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(54) Title: MULTIPLEX VGID 

(57) Abstract 

The present invention relates generally to 
the field of genomics. More particularly, the 
present invention relates to a method for gene 
identification beginning with user-selected input 
phenotypes. The method is ret^rred to generally 
as the ValiGene^^ Gene Identification method, 
or the VGID^'^ method. When more than two 
source populations of nucleic acids are simul- 
taneously compared, the method may be re- 
feired to as multiplex VGID^^. The method 
employs nucleic acid mismatch binding protein 
chromatography to effect a molecular compar- 
ison of one phenotype with others. Genes are 
identified as having a speci^^^^ function, or as 
causing or contributing to the cause or pathogen- 
esis of a specified disease, or as associated with 
a specific phenotype, by virtue of their selection 
by the method. Identified genes may be used in 
development of reagents, dmgs and/or combi- 
nation thereof useful in clinical or other settings 
for prognosis, diagnosis and/or treatment of dis- 
eases, disorders and/or conditions. The method 
is equally suited for gene identification for agri- 
cultural, bio-engineering, medical, veterinary, 
and many other applications. 
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MULTIPLEX VGID 

This application is a continuation-in-part of U.S. 
Patent Application Serial No. 09/0 07,9 05 (Attorney Docket No. 
9408-003) entitled "METHOD FOR IDENTIFYING GENES UNDERLYING 
5 DEFINED PHENOTYPES" filed January 15, 1998, which is 
incorporated herein by reference in its entirety. 

1. FIELD OF THE INVENTION 

The present invention relates generally to the field of 
genomics. More particularly, the present invention relates 

xo to a method for gene identification beginning with user- 
selected input phenotypes. The method is referred to 
generally as the ValiGene^*^ Gene Identification method, or the 
Vqjqsm method. The method employs nucleic acid mismatch 
binding protein chromatography to effect a molecular 
comparison of one phenotype with others. Genes are 

X5 identified as having a specified function, or as causing or 
contributing to the cause or pathogenesis of a specified 
disease, or as associated with a specific phenotype, by 
virtue of their selection by the method. Identified genes 
may be used in development of reagents, drugs and/ or 
combinations thereof useful in clinical or other settings for 

20 prognosis, diagnosis and/or treatment of diseases, disorders 
and/or conditions. The method is equally suited for gene 
identification for agricultural, bio-engineering , medical, 
veterinary, and many other applications. When more than two 
source populations of nucleic acids are simultaneously 
compared, the method may be referred to as multiplex VGID^^. 

25 

2, BACKGROUND OF THE INVENTION 

Identification of a particular genotype responsible for 
a given phenotype is an essential goal underlying gene-based 
medicine because it affords a rational departure point for 
the development of successful strategies for disease 
30 management, therapy and even cure. While, by one recent 
estimate, only two percent (2%) of the human genome has yet 
been sequenced, perhaps more than 50% of expressed human 
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genes are at least partially represented in existing 
databases (Duboule, D. , October 24, 1997, Editorial: The 
Evolution Of Genomics, Science 278, 555) , It is therefore 
quite clear that understanding functional interactions among 
the products of express genes represents the next great 
^ challenge in medicine and biology. This pursuit has been 
referred to as "functional genomics," although this term is 
perhaps too broad to have a clear meaning (Heiter, P. and 
Boguski, M. , October 24, 1997, Functional Genomics: It's All 
How You Read It, Science 278, 601-602) . Nevertheless, it is 
the prevailing view that cunctional genomics generally 
^0 describes " • . • a transition or expansion from the mapping 
and sequencing of genomes • . . to an emphasis on genome 
function." (Id.), Further, this new emphasis will require " 
. . , creative thinking in developing innovative technologies 
that make use of the vast resource of structural genomics 
information." Perhaps the best definition of functional 
genomics is " ♦ , . the development and application of global 
(genome-wide or system-wide) experimental approaches to 
assess gene function by making use of the information 
provided Jby structural genomics, {Id., emphasis added). 

One of the major advantages of the present invention is 
the circumvention of large-scale sequencing in determining 
2^ functional relationships among genes. The VGID^'^ method of 
the present invention is a straightforward yet very powerful 
genetic comparison or subtraction technique. Functional 
information is obtained from global (i.e. genome-wide) 
expressed gene comparison of two or more user-defined 
phenotypes using mismatch binding protein chromatography. 
25 With the VGID^M method, disease genes may be identified over a 
time period of weeks, unlike the years required to succeed 
using positional cloning. 



30 
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2.1. CHARACTERISTICS OP DISEASE AND OTHER 
PHENOTYPES 

Genetic diseases and other genetically-determined 
^ phenotypes, irrespective of mode of inheritance, can be due 

% to single or multiple lesions (i.e. mutations) affecting one 

^ gene or more than one gene simultaneously. Genetic 
I heterogeneity (i.e. a difference in DNA seguence) , by 

i definition, characterizes all diseases which have a genetic 

component. Genetic diseases can be further categorized among 
four broad genotypic groups, as described below. 

A mono-allelic disease is characterized as having a 
mutation in a single allele of a single gene. This disease 
group is the simplest in terms of genetic analysis since 
mono-allelic diseases arise, by definition, from a unique 
lesion affecting a single gene. Mono-allelic diseases have 
also been described as displaying "molecular monomorphism, " 
I which is another way of saying that a single molecular defect 

in a single gene accounts for the disease phenotype. Since 
such genetic lesions are unique, they are invariably 
"causative" of the disease in question. For a mono-allelic 
disease, only a few affected individuals need to undergo 
genetic analysis to attribute a given mutation to a disease 
phenotype. That is, large familial studies are not required 
2^ to identify the disease-causing gene. Only a few examples of 
such diseases are known. One example is sickle cell anemia, 
which is due to a single base substitution (i.e. A -> T) in 
the gene encoding hemoglobin. This base substitution changes 
the respective codon from GAG to GTG, ultimately resulting in 
a glutamate-to-valine amino acid substitution at position six 
25 of the hemoglobin 3 chain molecule and the characteristic, 
devastating sickle-shaped erythrocyte. 
- A polyallelic disease is characterized as having several 

different mutations arising independently in a single gene. 
Here, each independent mutation event gives rise to a 
different disease allele. A significant proportion of all 
genetic disease is thought to result in this way. Because 
such de novo mutations are so frequent, polyallelism is a 

- 3 - 
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very common characteristic of genetic disease* Duchenne ' s 
muscular dystrophy (DMD) , Becker's myopathy, and cystic 
fibrosis (CF) are well-known examples of polyallelic diseases 
(see e.g. McKusick, Mendelian Inheritance in Man, Catalog of 
Autosomal Dominant, Autosomal Recessive, and X-Linked 
^ Phenotypes, 10th Edition, 1992, The Johns Hopkins University 
Press, Baltimore, Maryland) . Polyallelism may arise in at 
least two ways. First, each new case of a disease may arise 
from an independent mutation event in the target gene. For 
example, in DMD, at least 3 0% of cases present novel 
mutations in the dystrophin gene which differ from all 
10 previously-characterised mutations. Second, selective 

fixation of different founder-effect mutations contributes to 
the occurrence of polyallelism. One example of this is the 
p-thallasemias in which the world population of affected 
individuals presents remarkably high polyallelism, but local 
populations are characterized by limited allelic 
heterogeneity, 

Non-allelic genetic disease is characterized as having 
more than one candidate gene. Here, a genetic disease which 
is clinically well-defined may be due to a lesion (mutation) 
of any one gene among several candidate genes. For example, 
imperfect osteogenesis is caused by lesion of any one of five 
2 0 distinct type 1 collagen genes. However, the identification 
of candidate genes for a non-allelic genetic disease is made 
more difficult when the several candidate genes, unlike the 
collagen genes, are not related in sequence. For example, 
pituitary dwarfism is physiologically due to hyperf unction of 
the anterior pituitary glarnd. In a minority of pituitary 
25 dwarfism cases, the causative lesion has been traced to the 
gene complex elaborating growth hormone (Kaplan and Delpech, 
1993, in Molecular Biology and Medicine, 2nd ed., 
M^decine-Sciences Flammarion, Paris, Chap. 12, pp. 307-308). 
In the vast majority of cases, however, these genes are 
perfectly normal and the causative disease loci are not even 
^0 linked to the growth hormone complex (as demonstrated by 
polymorphism linkage studies. Id.). Therefore, other 
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unidentified genes comprising alleles not related to growth 
hormone account for the majority of pituitary dwarf isin cases. 
Such non-allelic diseases clearly require more than just 
linkage analysis to identify all of the involved genes • The 
yQj£,sM method of the present invention provides a rapid, 
^ rational way of approaching this problem. 

A polygenic disease is characterized as having several 
abnormal genes acting concurrently to produce a pathologic 
phenotype. This group includes many genetic diseases often 
described as "multifactorial disorders." Examples include 
diabetes mellitus, hypertension, atherosclerosis, autoimmune 
disorders, and many others. For the majority of polygenic 
diseases, the metabolic complexities are so great that a 
rational basis on which candidate genes could be identified 
may not have existed before the invention set forth herein. 
In the few instances where a candidate gene has been 
suggested, this knowledge has still proven largely inadequate 
to identify susceptible individuals, or to explain 
pathogenesis. 

The last two groups of genetic disorders described above 
(i»e. non-allelism and polygenism) represent the greatest 
challenge currently facing human and veterinary medicine. 
Because of an absence of sufficient biochemical and 

2^ physiological data, credible candidate genes have largely 
gone unidentified. This absence of credible candidate genes 
has, in turn, ruled out the possibility of identifying 
susceptible individuals and attempting preventive 
intervention before symptoms appear. The invention set forth 
herein provides one way to overcome these limitations by 

2^ identifying credible candidate genes. 

2,2. GENE IDENTIFICATION BY POSITIONAL CLONING 

There are several known methods available to identify 
candidate disease genes, and to further select genes among 
identified candidates, which are systematically associated 
with a given pathology. These include various methods for 
differential expression analysis (e.gr. differential display, 

- 5 - 
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serial analysis of gene expression or SAGE) , and positional 
cloning methods. In the positional cloning approach, the 
initial steps are quite similar or identical; most often, it 
^ is only the final steps that differ (see e.g. Rommens et al , , 

- 1989, Science 245, 1059-1065; Duyk et al . , 1990, Proc. Natl. 

S Acad. Sci, U.S.A. 87, 8995-8999). The major drawbacks of 
I positional cloning methods generally include: (a) the slow 

% pace of discovery, often requiring several years for success; 

(b) the high complexity of the techniques involved, requiring 
highly-trained individuals who must pay painstaking attention 
to detail to get satisfactory results; (c) the labor- 
intensive nature of the techniques, often requiring enormous 
amounts of sequencing; and (d) the extreme expense associated 
with any slow, complex, labor-intensive effort. Positional 
cloning can be considered as four discrete steps which are 
well-known in the art. Each of these steps is briefly 
.| described below. 

I 15 

2*2.1. LINKAGE MAPPING 

The first step in using positional cloning for disease 
gene identification consists of a search for genetic linkage 
between a locus implicated in pathogenesis and a number of 
genotypic polymorphic markers. This step requires 

2^ segregation analysis in affected families. Linkage mapping 
takes advantage of the fact that the closer two genetic loci 
are to each other, the smaller the chances of an independent 
recombination event in separating them. Therefore, the aim 
is to rind a specific fragment of genomic DNA bordered by two 
M known markers systematically present in all affected members 

^5 of a family, but rarely present in the unaffected members. 
If such a genomic fragment can be identified, the pathogenic 
locus will be found located between the markers. 

Linkage mapping presents difficulties that vary 
according to the mode of inheritance of a disease. In an 
ideal linkage map, all bearers of an abnormal gene will be 
identified. In the case of an autosomal dominant disease, 
this is only theoretically possible if: (a) all bearers show 
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the diseased phenotype (i,e, penetrance is complete); and (b) 
disease manifestation is precocious. In the case of 
autosomal recessive disorders, it is only possible to detect 
the homozygotes (all affected) and the obligate heterozygotes 

^ (the parents) . It is t.i^refoT& essential to have access to 

^ families where there are at least two living, homozygous 

^ affected siblings when mapping an autosomal recessive 

M disorder. 

In a few lucky cases of linkage map construction and 
analysis, specific chromosomes can be easily ruled out as 
carrying the diseased g^ e of interest. In these rare 
instances, the gene search quickly becomes more focused. For 
example, DUD is a recessive disorder which is very rare in 
females. As a result, the search for the DMD gene could 
safely be limited to the X chromosome. However, in the 
majority of cases, such a simplified approach is not at all 

^ available. A case in point is CF, where it took five years 

'^^ of intensive effort just to identify the chromosome 
associated with the disease. 



2.2.2. CHROMOSOMAL LOCALIZATION 

The genomic fragment identified in the preceding step is 
often very large (i.e. several million bases) and entirely 

2^ unknown in terms of the number and identity of genes it 
encodes. Therefore, it is often essential to localize the 
genomic fragment to a specific chromosome in order to take 
advantage of other known markers which may not yet be 
associated with the fragment- Chromosomal localization may 
be carried out by utilization of polymorphic markers (e.g. 

2^ microsatellites) identified on genomic DNA or large genomic 
fragments cloned into yeast artificial chromosomes (YACs) 
that have been assigned to specific human chromosomes. 
Chromosomal localization may also be effected by 
f luorescently labeling a large (e.g. 100 kilobase) identified 
genomic fragment for hybridization and karyotype analysis 

^0 (Dauwerse et al . , 1992, Hum. Mol, Genet. 1, 593-598). 
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2.2,3. FURTHER REFINEMENT 

Once the identified genomic fragment has been localized 
to a specific chromosoine, the largest possible number of 
polymorphic markers is used to bracket the smallest possible 
region (i.e. locus) encoding the gene of interest. This step 
^ can yield genomic fragments that are still very large, i.e. 
I one-half to one million bases long. Since the average length 

of a gene is on the order of seventy thousand bases, such a 
region is very likely to encode many different genes. 
Furthermore, this approach does not allow one to distinguish 
between monogenic and poly^^nic disorders. If an apparent 
lack of genetic heterogeneity cannot be clinically 
determined, then the actual degree of heterogeneity must be 
assessed by systematic comparison of different families. In 
this very-frequent case, the results from each family must be 
analyzed separately to determine whether they are consistent 
with a "single locus" hypothesis. This is a complex problem 
since genetic heterogeneity may be clinically undetectable 
(e.g. pituitary dwarfism, see above). Alternatively, 
apparent clinical heterogeneity may lead to the erroneous 
conclusion that different genes are involved when, in fact, 
different allelic forms of the same gene are involved (e.g. 
DMD and Becker's myopathy, see above). 

20 

2.2.4* FROM LOCUS TO GENE 

Having defined a genetic locus for a disease-associated 
gene using the above methods, there is much work left to be 
done before the gene itself is ultimately identified. The 
identification problem encompasses two major dif f iculties c 
2^ First, xc is necessary to generate new markers for further 
map refinement. The new markers must be located as close as 
possible to, and ultimately in, the gene concerned. Second, 
it is necessary to demonstrate that the identified gene is 
actually responsible for the disease. These two tasks 
require the utilization, in parallel, of a wide variety of 
methods. Two of the most commonly followed approaches are 
briefly described below. 

- 8 - 
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Exon trapping involves the cloning of short fragments 
generated from an entire identified locus into retroviral 
vectors which have been engineered to reveal the presence of 
exons (i.e. coding sequences) within a short fragment- Any 
positive clones (i,e, clones containing an exon) function as 
5 new markers and must next be sequenced and mapped back to the 
locus in order to define the relative position of each. The 
exon trapping approach is enormously labor-intensive in that 
it requires massive amounts of DNA sequencing and produces a 
substantial number of false positives and false negatives. 
Of course, the exon map generated includes exons from any 
gene within the locus and is not specific to exons from the 
disease gene of interest. Accordingly, further work is 
required* 

Complementary DNA (cDNA) subtraction assays utilize cDNA 
libraries constructed from cells of an affected individual 
and from cells of a healthy individual. The procedure has 
two successive phases. In phase one, the cDNA inserts from 
the healthy individual are immobilized on a membrane and used 
to trap (subtract) the homologous cDNA inserts present in the 
affected individual's library. In phase two, the procedure 
is inverted: i.e. the cDNA inserts from the library of the 
affected individual are immobilized and used to subtract 
2^ homologous inserts from the healthy cDNA library* Therefore, 
these two phases yield cDNA fragments that are entirely 
unique to the affected or to the healthy individual, 
respectively. Any fragment homologous (similar but not 
identical) to a sequence present in the immobilized library 
remains trapped. Accordingly, this approach often results in 
a complete loss of the gene of interest. 

Clones obtained by the exon trapping or cDNA subtraction 
approaches are then used for direct hybridization to: 
(a) yeast artificial chromosome overlapping segments (YAC 
contigs) covering the locus of interest; (b) mRNA 
preparations obtained from affected and healthy individuals; 
and/or (c) enriched genomic libraries obtained from the same 
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affected and healthy individuals. Any positive hybridization 
signals are then further analyzed by sequencing. 

At the last step in positional cloning, i.e. gene 
identification, one is often confronted with results that 
cannot precisely pinpoint the relevant gene. In this 
^ instance, the only approach remaining is to entirely sequence 

^ and analyze the smallest genomic region of the defined locus, 

M 

1^ which may still range from 300 to 700 kxlobases. The 

problematic nature of positional cloning for disease gene 
identification is further highlighted below in noting a few 
of the realities associated with the approach. 
^0 Positional cloning projects are so labor intensive that 

they have been undertaken, in most instances, only by large 
consortia of international research groups comprising at 
least three laboratories per consortium. Each laboratory of 
such a consortium, in turn, is typically composed of five or 
^ more researchers devoting essentially all of their time and 

15 effort to the project. For example, identification of the CF 
gene took a total of eight years, finding the gene for 
polycystic kidney disease type 1 (PKDl) took six years, and 
finding the ataxia-telangiectasia gene took over five years. 
Many other examples could be recited, and many positional 
cloning efforts have yet to identify the target gene. 
2 0 Notably, these are all monogenelc diseases, i.e. only one 
gene is responsible for the disease and it is the same gene 
in all cases of the disease. 

The difficulties are amplified in the context of 
polygenic or multifactorial disorders. Here, very little 
progress has been made in gene identification. For example, 
25 after ov^er fifteen yea^s of intensive searching by a 

considerable number of research teams, the genetic causes of 
diabetes mellitus (type I and type II) remain largely 
unknown- The same can be said for chronic renal failure 
(CRF) , multiple sclerosis (MS), atherosclerosis, and many 
others. This list names only a few of the most prevalent 
polygenic or multifactorial disorders. 
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One of the major reasons for this state of affairs is 
that, in the absence of any information allowing the testing 
of likely candidate genes, it is necessary to first map the 
loci associated v/ith the disorder to specific chromosomal 
regions before having a chance of isolating the genes 
concerned by positional cloning (see above). of course, it 
would be considerably simpler to forego mapping entirely and 
work from mRNA transcripts of genes expressed in affected 
tissues* However, this approach has proven virtually 
impossible using past methods. This is due, at least in 
part, to the fact that tissues and cells express a great many 
genes. Furthermore, genes associated with pathologies are 
often expressed at very low levels. Therefore, the few 
relevant disease iriRNA transcripts may be lost among an 
enormous number of other transcripts- Still further adding 
to the identification problem, the disease transcripts may 
differ widely among affected individuals. These intrinsic 
shortcomings of past positional and subtraction methodologies 
are such that very small quantities of mRNA cannot be used. 

The VGID^'^ method for gene identification set forth 
herein provides a simple solution to this enormous problem. 
It allows one to identify phenotype-associated genes, in 
monogenic as well as polygenic contexts, in a matter of weeks 
rather than years and at greatly reduced expense, 

2.3. MISMATCH REPAIR 

DNA mismatch repair genes comprise one of several 
mechanisms by which high fidelity DNA replication is 
maintained in cells under physiologic conditions. Many 
investigators over the years have manipulated one or more of 
these genes to achieve various ends. First described in 
bacteria, the mismatch repair system comes into play when the 
product of the MutS gene recognizes and binds to a mispaired 
base pair (see Cox, E.G., 1997, MutS, Proofreading And 
Cancer, Genetics 146, 443-446). MutS works in concert with 
the products of the MutH and MutL genes; these three proteins 
together form the so-called MutHLS mismatch repair system. A 
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recent review has provided a detailed description of this 
system in eukaryotes (see Kolodner, R. , 1996, Biochemistry 
And Genetics Of Eukaryotic Mismatch Repair, Genes Dev. 10, 
1433-1442) . 

Hereditary nonpolyposis colon cancer (HNPCC) arises from 
mutations in the hMSH2 gene, the human homolog of the 
bacterial MutS gene, as shown by two laboratories in 199 3 
(see Fishel, R. et al . , 1993, The Human Mutator Gene Homolog 
MSH2 And Its Association With Hereditary Nonpolyposis Colon 
Cancer, Cell 75, 1027-1038; Leach, F»S. et al . , 1993, 
Mutations Of A MutS Homolog In Hereditary Nonpolyposis 
Colorectal Cancer, Cell 75, 1215-1225). The human MSH2 
protein also functions via binding to DNA mismatches (Fishel, 
R. et al , , 1994, Binding Of Mismatched Microsatellite DNA 
Sequences By The Human MSH2 Protein, 5cie73ce 266, 1403-1405; 
Fishel, R, et ai . , 1994, Purified Human MSH2 Protein Binds To 
DNA Containing Mismatched Nucleotides, Cancer Res. 54, 5539- 
5542) * Another human homolog of bacterial MutS has recently 
been linked to cancer susceptibility (Edelman, W. et al., 
November 14, 1997, Mutation In The Mismatch Repair Gene Msh6 
Causes Cancer Susceptibility, Cell 91, 467-477). 

Traditionally, manipulation of the mismatch repair 
system has been employed in a variety of ways* For example, 
a method for in vitro recombination of mismatches has been 
described which takes advantage of MutS-def icient coli 
(Resnick, M.A. and Radman, M. , August 2, 1994, System For 
Isolating And Producing New Genes, Gene Products And DNA 
Seguences, U,S, Patent No. 5,334,522). Others have described 
using the MutS protein to detect DNA mismatches in vitro with 
antibodief^ (Wagner, R.E., Jr. and Radman, M. , April 2, 1997, 
Method For Detection Of Mutations, European Patent EP 0 596 
028 Bl) , Still others have used the inability of the system 
to repair loops of five nucleotides or greater in vivo to 
design a system capable of detecting a single mismatch in a 
DNA fragment as large as 10 kilobases (see Faham, M. and Cox, 
D.R., 1997, A Novel in vivo Method To Detect DNA Sequence 
Variation, Genome Research 5, 474-482). 
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3. SUMMARY OF THE INVENTION 

This invention provides a method for identifying a gene 
or allele, or several genes or alleles, underlying a 
phenotype-of -interest . In this regard, genes or alleles are 
identified as having a specified function, or as causing or 
^ contributing to the cause or pathogenesis of a specified 
disease, or as associated with a specific phenotype, by 
I virtue of their selection by the method • 

This invention is based, at least in part, on the 
recognition that comparison of a population of nucleic acid 
molecules with one or more other populations of nucleic acid 
molecules, so as to isolate genes underlying specific 
phenotypic traits, is greatly facilitated by first taking 
steps to insure internal homogonization of one or more of the 
populations to be compared before performing the external 
comparison of two or more populations ♦ In this regard, 
internal homogenization is effected by a first round of 
■^-^ hybridization and sorting of matched from mismatched DNA 
duplexes. Similarly, external comparison is effected by a 
second round of hybridization and sorting of matched from 
mismatched DNA duplexes, as described in detail hereinbelow. 

This invention provides a method for identifying one or 
more genes underlying a defined phenotype comprising the 
following steps in the order stated: (a) removing mismatched 
duplex nucleic acid molecules formed from hybridization 
within each of two source populations of nucleic acids; and 
(b) retaining mismatched duplex nucleic acid molecules formed 
from hybridization between the two source populations, the 
retained molecules in step (b) comprising the one or more 
2^ genes underlying the defined phenotype. 

Further, this invention provides a method for 
identifying one or more genes underlying a defined phenotype 
comprising the following steps in the order stated: (a) 
removing mismatched duplex nucleic acid molecules formed from 
hybridization within a first source population of nucleic 
acids; and (b) retaining mismatched duplex nucleic acid 
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molecules formed from hybridization between the first source 
population and a second source population of nucleic acids, 
the retained molecules in step (b) comprising the one or more 
genes underlying the defined phenotype. 

Nucleic acid sample populations may be derived from many 
different sources. in one embodiment, the first and second 
source populations each are nucleic acid populations derived 
from at least two individuals having consanguinity. In 
another embodiment, the first and second source populations 
each are nucleic acid populations derived from more than two 
individuals having consanguinity. In one embodiment, the 
first and second source populations each are nucleic acid 
populations derived from two to six individuals having 
consanguinity. In another embodiment, the first and second 
source populations each are nucleic acid populations derived 
from three individuals having consanguinity. In still 
another embodiment, each source population is a cell line. 

Further, nucleic acid sample populations may be 
manipulated in various ways so as to facilitate gene 
identification. In one embodiment, the source populations 
are normalized cDNA libraries to facilitate identification of 
rare transcripts. In another embodiment, the source 
populations are linearized cDNA libraries to facilitate 
hybridization. In still another embodiment, the source 
populations are normalized and linearized. 

Still further, nucleic acid sample populations may be 
manipulated in various ways so as to facilitate removal of 
undesired cDNAs. In one embodiment, the two source 
populations are of DNA, the DNA of a source population is 
labeled, and the hybridization in step (b) is carried out 
using an excess of labeled DNA. In another embodiment, the 
excess of labeled DNA is a three-fold excess. 

Genes underlying virtually any defined phenotype may be 
identified using the method of the invention. In a preferred 
embodiment, the defined phenotype is selected from the group 
consisting of a plant resistance phenotype, a microorganism 
resistance phenotype, cancer, osteoporosis, obesity, type II 
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diabetes, and a pr ion-related disease. Additional examples 
of preferred defined phenotypes follow immediately below. 

Defined plant phenotypes include but are not limited to 
resistance to herbicides, resistance to insect predators, 
resistance to fungal infections, increased yields, resistance 
to frost, resistance to dehydration, enhanced stem strength, 
and many others. 

Defined microorganism phenotypes include but are not 
limited to susceptibility or resistance to antibiotics, 
detoxification of liquids, soils, solids, and/or gases 
contaminated by pollutants or toxic compounds (e.g. dioxin, 
nitrous oxides, carbon monoxide, sulfer dioxide, free 
radicals, and so on) . 

Defined animal and/ or veterinary phenotypes include but 
are not limited to resistance to neurological disorders such 
as pr ion-related diseases, infectious disorders (e.g. porcine 
plague) , foot-and-mouth disease, and many others. 

Defined human phenotypes include but are not limited to 
susceptibility to cancer, autoimmune diseases, neurological 
disorders, metabolic disorders (e.g. diabetes, obesity), 
systemic diseases (e.g. osteoporosis), and many others. 

This invention provides a method for identifying one or 
more genes underlying a defined phenotype displayed by a cell 
or individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a second 
cDNA library is derived. The method comprises the steps of 
(a) hybridizing insert DNA from the first cDNA library with 
itself, (b) hybridizing insert DNA from the second cDNA 
library with itself, (c) contacting the DNA hybridized in 
step (a) with a first immobilized mismatch binding protein, 
(d) contacting the DNA hybridized in step (b) with a second 
immobilized mismatch binding protein, (e) separating unbound 
DNA from bound DNA contacted in step (c) , (f) separating 
unbound DNA from bound DNA contacted in step (d) , (g) 
labeling unbound DNA separated in step (f ) with a label 
capable of binding a partner molecule or agent immobilized on 
a substrate, (h) hybridizing labeled DNA with unbound DNA 
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separated in step (e) , (i) contacting DNA hybridized in step 
(h) with a third immobilized mismatch binding protein, (j) 
separating unbound DNA from bound DNA contacted in step (i) , 
(k) contacting unbound DNA separated in step (j) with the 
partner molecule or agent immobilized on the substrate 

^ capable of binding the label, and (1) separating unbound DNA 
from bound DNA contacted in step (k) , which unbound DNA 
separated in step (1) encodes one or more identified genes 
underlying the defined phenotype. 

Further, this invention provides a method for 
identifying one or more genes underlying a defined phenotype 
from organisms having consanguinity. The method comprises 
the steps of (a) hybridizing insert DNA from a first 
collection of cDN;^ ibraries derived from organisms having 
the defined phenotype with itself, (fo) contacting DNA 
hybridized in step (a) with a first immobilized mismatch 
binding protein, (c) separating unbound DNA from bound DNA 

■5 contacted in step (b) , (d) labeling unbound DNA separated in 
step (c) with a label capable of binding a partner molecule 
or agent immobilized on a substrate, (e) hybridizing DNA 
labeled in step (d) with insert DNA from a second collection 
of cDNA libraries derived from organisms not having the 
defined phenotype, (f) contacting DNA hybridized in step (e) 

^ with a second immobilized mismatch binding protein, (g) 

separating unbound DNA from bound DNA contacted in step (f ) , 
(h) contacting unbound DNA separated in step (g) with the 
partner molecule or agent immobilized on the substrate 
capable of binding the label, and (i) separating unbound DNA 
from bound DNA contacted in step (h) , which unbound DNA 

^ separated in step (i) encodes identified genes underlying the 
defined phenotype. This paragraph sets forth a preferred 
embodiment in which the DNA labeled in step (d) corresponds 
to undesired material labeled for removal • 

Still further, this invention provides a method for 
identifying one or more alleles underlying a defined 

^ phenotype displayed by a cell or individual from which a 

first cDNA library is derived, but not displayed by a cell or 
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individual from which a second cDNA library is derived. The 
method comprises the steps of (a) hybridizing insert DNA from 
the first cDNA library with itself, (b) hybridizing insert 
DNA from the second cDNA library with itself, (c) contacting 
the DNA hybridized in step (a) with a first immobilized 
mismatch binding protein, (d) contacting the DNA hybridized 
in step (b) with a second immobilized mismatch binding 
protein, (e) separating unbound DNA from bound DNA contacted 
in step (c) , (f ) separating unbound DNA from bound DNA 
contacted in step (d) , (g) labeling unbound DNA separated in 
step (f) with a label capable of binding a partner molecule 
or agent immobilized on a substrate, (h) hybridizing DNA 
labeled in step (g) with unbound DNA separated in step (e) , 
(i) contacting DNA hybridized in step (h) with a third 
immobilized mismatch binding protein, (j) separating unbound 
DNA from bound DNA contacted in step (i) , (k) releasing bound 
DNA separated in step (j) from the third immobilized mismatch 
binding protein, (1) contacting DNA released in step (k) with 
the partner molecule or agent immobilized on the substrate 
capable of binding the label, (m) denaturing DNA contacted in 
step (1), and (n) separating unbound DNA from bound DNA 
denatured in step (m) , which unbound DNA separated in step 
(n) encodes one or more identified alleles underlying the 
defined phenotype. 

Yet still further, this invention provides a method for 
identifying one or more alleles underlying a defined 
phenotype from organisms having consanguinity. The method 
comprises the steps of (a) hybridizing insert DNA from a 
first collection of cDNA libraries derived from organisms 
having the defined phenotype with itself, (b) contacting DNA 
hybridized in step (a) with a first immobilized mismatch 
binding protein, (c) separating unbound DNA from bound DNA 
contacted in step (b) , (d) labeling unbound DNA separated in 
step (c) with a label capable of binding a partner molecule 
or agent immobilized on a substrate, (e) hybridizing DNA 
labeled in step (d) with insert DNA from a second collection 
of cDNA libraries derived from organisms not having the 
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defined phenotype, (f) contacting DNA hybridized in step (e) 
with a second immobilized mismatch binding protein, (g) 
separating unbound DNA from bound DNA contacted in step (f ) , 
(h) releasing bound DNA separated in step (g) from the second 
immobilized mismatch oinding protein, (i) contacting DNA 
^ released in step (h) with the partner molecule or agent 
immobilized on the substrate capable of binding the label, 
(j) denaturing DNA contacted in step (i), and (k) separating 
bound DNA from unbound DNA denatured in step (j) , which bound 
DNA separated in step (k) encodes one or more identified 
alleles underlying the refined phenotype. 

The cDNA library collections will vary according to the 
specific attributes of t>^e sample source. In one embodiment, 
the first and second cDNA library collections each are 
nucleic acid populations derived from at least two 
individuals having consanguinity. In another embodiment, the 
II first and second cDNA library collections each are nucleic 

acid populations derived from more than two individuals 
having consanguinity. In one embodiment, the first and 
second cDNA library collections each are nucleic acid 
populations derived from two to six individuals having 
consanguinity. In another embodiment, the first and second 
cDNA library collections each are nucleic acid populations 
derived from three individuals having consanguinity, 

A nucleic acid sample population may be left unlabeled 
or labeled with a unique label in various ways. In one 
embodiment, labeling is effected by polymerase chain reaction 
using a 5 '-biotinylated primer. In another embodiment, 
labeling is effected by polymerase chain reaction using a 5'- 
25 peptide-labeled primer. In a preferred embodiment, labeling 
using a 5 '-biotinylated primer is performed when using one 
unlabeled sample population and one labeled sample 
population. In another preferred embodiment, labeling using 
a 5* -peptide-labeled primer is performed when multiplexing, 
i.e. when using three or more nucleic acid sample 
populations. 
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A labeled nucleic acid sample population may be sorted 
in various ways. In one embodiment, the substrate for 
binding the biotin label is streptavidin . In another 
embodiment, the substrate for binding the peptide label is an 
4. antibody. In still another embodiment, the antibody is an 

^ anti-peptide antibody. In yet still another embodiment, the 
I anti-peptide antibody is monoclonal. 

I A variety of wild-type and recombinant, engineered 

mismatch binding proteins may be used to effect sorting (i.e. 
binding and release) of DNA duplexes containing mismatches. 
jLn one embodiment, the mismatch binding protein is E. coli 
MutS. In another embodiment, the mismatch binding protein is 
hMSH2 . In still another embodiment, the mismatch binding 
protein is an hMSH2-hMSH6 protein complex. 

This invention provides a method for identifying one or 
more genes underlying a defined phenotype displayed by a cell 

<^ or individual from which a first cDNA library is derived, but 

^ not displayed by a cell or individual from which a second 

cDNA library is derived. The method comprises the steps of 
(a) amplifying insert DNA from the first cDNA library by 
polymerase chain reaction, (b) amplifying insert DNA from the 
second cDNA library by polymerase chain reaction, (c) 
hybridizing DNA amplified in step (a) with itself, (d) 
20 hybridizing DNA amplified in step (b) with itself, (e) 
contacting DNA hybridized in step (c) with a first 
immobilized MutS, (f) contacting DNA hybridized in step (d) 
with a second immobilized MutS, (g) separating unbound DNA 
from bound DNA contacted in step (e) , (h) separating unbound 

^ DNA from bound DNA contacted in step (f ) , (i) amplifying 

25 unbound DNA separated in step (g) by polymerase chain 

reaction using unlabeled primers, (j) amplifying and labeling 
unbound DNA separated in step (h) by polymerase chain 
reaction using 5 * -biotinylated primers, (k) hybridizing DNA 
amplified and labeled in step (j) with DNA amplified in step 
(i) , (1) contacting DNA hybridized in step (3c) with a third 
2 0 immobilized MutS, (m) separating unbound DNA from bound DNA 

^ contacted in step (1), (n) contacting unbound DNA separated 
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in step (m) with imitiobilized streptavidin, and (o) separating 
unbound DNA from bound DNA contacted in step (n) , which 
unbound DNA separated in step (o) encodes one or more 
identified genes underlying the defined phenotype. 
Further, this invention provides a method for 
^ identifying one or more genes underlying a disease phenotype 
f| from healthy and affected individuals having consanguinity. 

I The method comprises the steps of (a) amplifying insert DNA 

from a first collection of cDNA libraries derived from 
affected individuals by polymerase chain reaction, (b) 
hybridizing DNA amplified in step (a) with itself, (c) 
-^^ contacting DNA hybridized in step (b) with a first 

immobilized MutS, (d) separating unbound DNA from bound DNA 
contacted in step (c) , (e) amplifying and labeling unbound 
DNA separated in step (d) by polymerase chain reaction using 
5 ' -biotinylated primers, (f) amplifying insert DNA from a 
second collection of cDNA libraries derived from healthy 
individuals by polymerase chain reaction, (g) hybridizing DNA 
amplified and labeled in step (e) with DNA amplified in step 
(f ) , (h) contacting DNA hybridized in step (g) with a second 
immobilized MutS, (i) separating unbound DNA from bound DNA 
contacted in step (h) , (j) contacting unbound DNA separated 
in step (i) with immobilized streptavidin, and (k) separating 
2^ unbound DNA from bound DNA contacted in step (j) . which 
unbound DNA separated in step (k) encodes one or more 
identified genes underlying the disease phenotype. 

Still further, this invention provides a method for 
identifying one or more alleles underlying a defined 
phenotype displayed by a cell or individual from which a 
first cDNA library is derived, but not displayed by a cell or 
individual from which a second cDNA library is derived. The 
method comprises the steps of (a) amplifying insert DNA from 
the first cDNA library by polymerase chain reaction, (b) 
amplifying insert DNA from the second cDNA library by 
polymerase chain reaction, (c) hybridizing DNA amplified in 
3^ step (a) with itself, (d) hybridizing DNA amplified in step 
(b) with itself, (e) contacting DNA hybridized in step (c) 
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with a first immobilized MutS, (f) contacting DNA hybridized 
in step (d) with a second immobilized MutS, (g) separating 
unbound DNA from bound DNA contacted in step (e) , (h) 
separating unbound DNA from bound DNA contacted in step (f ) , 
(i) amplifying unbound DNA separated in step (g) by 
^ polymerase chain reaction using unlabeled primers, (j) 

amplifying and labeling unbound DNA separated in step (h) by 
polymerase chain reaction using 5 ' -biotinylated primers, (k) 
hybridizing DNA amplified and labeled in step (j) with DNA 
amplified in step (i) , (1) contacting DNA hybridized in step 
(k) with a third immobilized MutS, (m) separating unbound DNA 
^0 from bound DNA contacted in step (1), (n) releasing bound DNA 
separated in step (m) from the third immobilized MutS, (o) 
contacting DNA relcc».sed in step (n) with immobilized 
streptavidin, (p) denaturing DNA contacted in step (o) , and 
(g) separating unbound DNA from bound DNA denatured in step 
(p) , which unbound DNA separated in step (q) encodes one or 
more identified alleles underlying the defined phenotype. In 
one embodiment, releasing bound DNA from the third 
immobilized MutS in step (n) is carried out using ATP or 
proteinase 

Yet still further, this invention provides a method for 
identifying one or more affected alleles underlying a disease 

2^ phenotype from healthy and affected individuals having 
consanguinity. The method comprises the steps of (a) 
amplifying insert DNA from a first collection of cDNA 
libraries derived from affected individuals by polymerase 
chain reaction, (b) hybridizing DNA amplified in step (a) 
with itself, (c) contacting DNA hybridized in step (b) with a 

2^ first immobilized MutS, (d) separating unbound DNA from bound 
DNA contacted in step (c) , (e) amplifying and labeling 
unbound DNA separated in step (d) by polymerase chain 
reaction using 5 • -biotinylated primers, (f) amplifying insert 
DNA from a second collection of cDNA libraries derived from 
healthy individuals by polymerase chain reaction, (g) 

3^ hybridizing DNA amplified and labeled in step (e) with DNA 
amplified in step (f ) , (h) contacting DNA hybridized in step 
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(g) with a second immobilized MutS, (i) separating unbound 
DNA from bound DNA contacted in step (h) , (j) releasing bound 
DNA separated in step (i) from the second immobilized MutS, 
(k) contacting DNA released in step (j) with immobilized 
streptavidin, (1) denaturing DNA contacted in step (k) , and 

^ (m) separating bound DNA from unbound DNA denatured in step 
(1) , which bound DNA separated in step (m) encodes one or 
more identified affected alleles underlying the disease 
phenotype. In one embodiment, releasing bound DNA from the 
second immobilized MutS in step (j) is carried out using ATP 
or proteinase K. 

^ Yet still further, this invention provides a method for 

identifying one or more genes underlying a defined phenotype 
displayed by a cell or individual from which a first cDNA 
library is derived, but not displayed by a cell or individual 
from which a plurality of additional cDNA libraries is 
derived. The method comprises the steps of (a) hybridizing 

^ insert DNA from each cDNA library with itself, (b) contacting 
each separate population of DNA hybridized in step (a) 
individually with an immobilized mismatch binding protein, 
(c) separating unbound DNA from bound DNA contacted 
individually in step (b) , (d) labeling each separate 
population of unbound DNA separated in step (c) with a 

^ different label capable of binding a partner molecule 

immobilized on a substrate, (e) hybridizing DNA separately 
labeled in step (d) , (f ) contacting DNA hybridized in step 
(e) with an immobilized mismatch binding protein, and (g) 
separating unbound DNA from bound DNA contacted in step (f ) , 
Still further, th-^s invention provides a method for 

^ identifying one or more genes underlying a defined phenotype 
displayed by a cell or individual from which a first cDNA 
library is derived, but not displayed by a cell or individual 
from which a plurality of additional cDNA libraries is 
derived. The method comprises the steps of (a) amplifying 
insert DNA from each cDNA library by polymerase chain 

^ reaction, (b) hybridizing each separate population of DNA 
amplified in step (a) with itself, (c) contacting each 
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separate population of DNA hybridized in step (b) 
individually with immobilized MutS, (d) separating unbound 
DNA from bound DNA contacted in step (c) , (e) labeling each 
separate population of unbound DNA separated in step (d) by 
polymerase chain reaction using a distinct 5 ' -peptide-labeled 
primer capable of binding a partner molecule immobilized on a 
substrate, (f ) hybridizing DNA labeled in step (e) , (g) 
contacting DNA hybridized in step (f) with immobilized MutS, 
and (h) separating unbound DNA from bound DNA contacted in 
step (g) . 

Further, this invention provides a method for 
identifying one or more alleles underlying a defined 
phenotype displayed by a cell or individual from which a 
first cDNA library is derived, but not displayed by a cell or 
individual from which a plurality of additional cDNA 
libraries is derived. The method comprises the steps of (a) 
hybridizing insert DNA from each cDNA library with itself, 
(b) contacting each separate population of DNA hybridized in 
step (a) individually with an immobilized mismatch binding 
protein, (c) separating unbound DNA from bound DNA contacted 
in step (b) , (d) labeling each separate population of unbound 
DNA separated in step (c) with a distinct label capable of 
binding a partner molecule immobilized on a substrate, (e) 
hybridizing DNA labeled in step (d) , (f ) contacting DNA 
hybridized in step (e) with an immobilized mism^-^.tch binding 
protein, and (g) separating unbound DNA from bound DNA 
contacted in step (f ) . 

Still further, this invention provides a method for 
identifying one or more alleles underlying a dp-^ined 
phenotype displayed by a cell or individual from which a 
first cDNA library is derived, but not displayed by a cell or 
individual from which a plurality of additional cDNA 
libraries is derived. The method comprises the steps of (a) 
amplifying insert DNA from each cDNA library by polymerase 
chain reaction, (b) hybridizing DNA amplified from each 
library in step (a) with itself, (c) contacting DNA from each 
library hybridized in step (b) individually with an 
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iinmobilized mismatch binding protein, (d) separating unbound 
DNA from bound DNA contacted in step (c) , (e) amplifying and 
labeling each separate population of unbound DNA separated in 
step (d) by polymerase chain reaction using a distinct 5'- 
peptide-labeled primer, (f) hybridizing DNA amplified and 
labeled in step (e) , (g) contacting DNA hybridized in step 
(f) with an immobilized mismatch binding protein, (h) 
separating unbound DNA from bound DNA contacted in step (g) , 
(i) releasing bound DNA separated in step (h) , and (j) 
separating DNA released in step (i) into single strands. 

Still further, this invention provides a method for 
identifying one or more alleles underlying a defined 
phenotype comprising the following steps in the order stated: 

(a) removing mismatched duplex nucleic acid molecules formed 
from hybridization within each of a plurality of source 
populations of nucleic acids; (b) retaining mismatched duplex 
nucleic acid molecules formed from hybridization among the 
plurality of source populations; (c) separating mismatched 
strands retained in step (b) , which separated strands 
comprise one or more alleles underlying the defined 
phenotype • 

This invention provides a method for identifying one or 
more genes underlying a defined phenotype. The method 
comprises the steps of (a) removing mismatched duplex nucleic 
acid molecules formed from hybridization within each of a 
plurality of source populations of nucleic acids, and 

(b) retaining mismatched duplex nucleic acid molecules formed 
from hybridization among the plurality of source populations, 
the retained molecules in step (b) comprising the one or more 
genes underlying the defined phenotype. In one embodiment, 
the plurality of source populations comprises at least one 
normalized cDNA library. in another embodiment, the 
plurality of source populations comprises at least one 
linearized cDNA library. in yet another embodiment, the 
plurality of source populations consists of DNA, the DNA of 
each of the source populations being labeled with a different 
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label, and the hybridization in step (b) is carried out using 
an excess of labeled DNA from one or more source populations. 
In one embodiment, the excess of labeled DNA is a three-fold 
excess. Yet in another embodiment, each of the source 
populations is derived from a cell line, 
^ This invention also provides a method for identifying 

one or more genes underlying a defined phenotype displayed by 
a cell or individual from which a first cDNA library is 
derived, but not displayed by a cell or individual from which 
a plurality of additional cDNA libraries is derived. The 
method comprises the steps of (a) hybridizing insert DNA from 
the first cDNA library with itself, (b) hybridizing insert 
DNA from each library of the plurality of additional cDNA 
libraries with itself, (c) contacting the DNA hybridized in 
step (a) with an immobilized mismatch binding protein, 
(d) contacting each separate population of DNAs hybridized in 
step (b) individually with an immobilized mismatch binding 
^5 protein, (e) separating unbound DNA from bound DNA contacted 
in step (c), (f) separating unbound DNA from bound DNA 
contacted individually in step (d) , (g) labeling each 
separate population of the unbound DNA separated in step (f ) 
with a distinguishable label capable of binding a partner 
molecule immobilized on a substrate, (h) hybridizing DNA 
20 separately labeled in step (g) with unbound DNA separated in 
step (e) , (i) contacting DNA hybridized in step (h) with an 
immobilized mismatch binding protein, (j) separating unbound 
DNA from bound DNA contacted in step (i) , (k) contacting 
unbound DNA separated in step (j) with the partner molecule 
of each different label, and (1) separating unbound DNA from 
^ bound DNA contacted in step (k) , which unbound DNA separated 
in step (1) encodes one or more identified genes underlying 
the defined phenotype. In one embodiment, one or more of the 
cDNA libraries is normalized. In another embodiment, one or 
more of the cDNA libraries is linearized. In yet another 
embodiment, labeling is carried out by polymerase chain 
0 reaction using a 5 ' -peptide labeled primer. In yet another 
embodiment, at least one partner molecule immobilized is an 
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antibody. In still another embodiment, the antibody is an 
anti-peptide antibody. In yet another embodiment, the 
hybridization in step (h) is carried out using an excess of 
labeled DNA. In yet another embodiment, the excess of 
labeled DNA is a three-fold excess. In yet another 
^ embodiment, an immobilized mismatch binding protein is Muts. 
In one embodiment, the defined phenotype is selected from the 
group consisting of a plant phenotype, a microorganism 
phenotype, and a pathologic phenotype. 

In another embodiment, the defined phenotype is a pathologic 
phenotype that is selected from the group consisting of 
^0 cancer, osteoporosis, obesity, type II diabetes, and a prion- 
related disease* 

This invention further provides a method for identifying 
one or more genes underlying a defined phenotype displayed by 
a cell or individual from which a first cDNA library is 
derived, but not displayed by a cell or individual from which 
a plurality of additional cDNA libraries is derived. The 
method comprises the steps of (a) amplifying insert DNA from 
the first cDNA library by polymerase chain reaction, (b) 
amplifying insert DNA from each of the plurality of 
additional cDNA libraries by polymerase chain reaction, 

(c) hybridizing DNA amplified in step (a) with itself, 

(d) hybridizing each separate population of DNA amplified in 
step (b) with itself, (e) contacting DNA hybridized in step 
(c) with immobilized MutS, (f) contacting each separate 
population of DNA hybridized in step (d) individually with 
immobilized MutS, (g) separating unbound DNA from bound DNA 
contacted in step (e) . (h) separating unbound DNA from bound 

25 DNA contacted in step (f ) , (i) labeling unbound DNA separated 
in step (g) by polymerase chain reaction using unlabeled 
primers, (j) labeling each separate population of unbound DNA 
separated in step (h) by polymerase chain reaction using a 
primer having a distinguishable 5 • -peptide-label capable of 
binding a partner molecule immobilized on a substrate, (k) 

30 hybridizing DNA labeled in step (i) with DNA labeled in step 
(j), (1) contacting DNA hybridized in step (k) with 



- 26 - 



EJNSDOCID: <W0 9936575Al_l_ > 



W099/36575 



PCT/US99/01037 



immobilized MutS, (m) separating unbound DNA from bound DNA 
contacted in step (1) , (n) contacting unbound DNA separated 
in step (m) with one or more partner molecules capable of 
binding the distinguishable 5 » -peptide-labeled primers, and 

H (o) separating unbound DNA from bound DNA contacted in step 

^ (n) , which unbound DNA separated in step (o) encodes one or 

I more identified genes underlying the defined phenotype. 

^ This invention provides a method for identifying one or 

more alleles underlying a defined phenotype displayed by a 
cell or individual from which a first cDNA library is 
derived, but not displayed by a cell or individual from which 
a plurality of additional cDNA libraries is derived. The 
method comprises the steps of (a) hybridizing insert DNA from 
the first cDNA library with itself, (b) hybridizing insert 
DNA from each of the plurality of additional cDNA libraries 
with itself, (c) contacting DNA hybridized in step (a) with 

I an immobilized mismatch binding protein, (d) contacting each 

separate population of DNA hybridized in step (fo) 
individually with an immobilized mismatch binding protein, 
(e) separating unbound DNA from bound DNA contacted in step 
(c) , (f ) separating unbound DNA from bound DNA contacted in 
step (d) , (g) labeling each separate population of unbound 
DNA separated in step (f) with a distinguishable label 
2^ capable of binding a partner molecule immobilized on a 
substrate, (h) hybridizing DNA labeled in step (g) with 
unbound DNA separated in step (e) , (i) contacting DNA 
hybridized in step (h) with an immobilized mismatch binding 
protein, (j) separating unbound DNA from bound DNA contacted 

^ in step (i) , (k) relecising bound DNA separated in step (j) 

2^ from the immobilized mismatch binding protein, (1) contacting 
DNA released in step (k) with one or more partner molecules 
capable of binding the distinct labels, (m) denaturing DNA 
contacted in step (1), and (n) separating unbound DNA from 
bound DNA denatured in step (m) , which unbound DNA separated 
in step (n) encodes one or more identified alleles underlying 
the defined phenotype. In one embodiment, at least one cDNA 

^; library is normalized. In another embodiment, at least one 
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cDNA library is linearized. In one embodiment, labeling is 
carried out by polymerase chain reaction using 5 '-peptide 
labeled primers. In another embodiment, at least one 
immobilized partner molecule is an antibody. in another 
^ embodiment, the antibody is an ant i -peptide antibody. in 

^ another embodiment, the hybridization in step (h) is carried 
I out using an excess of labeled DNA. In another embodiment, 

I the excess of labeled DNA is a three-fold excess. In another 

embodiment, at least one of the immobilized mismatch binding 
proteins is MutS. 

This invention provides a method for identifying one or 
more alleles underlying a defined phenotype displayed by a 
cell or individual from which a first cDNA library is 
derived, but not displayed by a cell or individual from which 
a plurality of additional cDNA libraries is derived. The 
method comprises the steps of (a) amplifying insert DNA from 
^ the first cDNA library by polymerase chain reaction, (b) 

amplifying insert DNA from each of the plurality of 
additional cDNA libraries by polymerase chain reaction, 

(c) hybridizing DNA amplified in step (a) with itself, 

(d) hybridizing DNA amplified from each library in step (b) 
with itself, (e) contacting DNA hybridized in step (c) with 
immobilized MutS, (f) contacting each population of DNA 

2^ hybridized in step (d) individually with immobilized MutS, 
(g) separating unbound DNA from bound DNA contacted in step 

(e) , (h) separating unbound DNA from bound DNA contacted in 
step (f ) , (i) amplifying unbound DNA separated in step (g> by 
polymerase chain reaction using unlabeled primers, (j) 
amplifying and labeling each population of unbound DNA 

2^ separated in step (h) by polymerase chain reaction using a 
distinguishable 5 ' -peptide-labeled primer, (k) hybridizing 
^ DNA amplified and labeled in step (j) with DNA amplified in 

step (i) , (1) contacting DNA hybridized in step (k) with 
immobilized MutS, (m) separating unbound DNA from bound DNA 
contacted in step (1) , (n) releasing bound DNA separated in 
2^ step (m) from immobilized MutS, (o) contacting DNA released 
§ in step :n) with one or more immobilized antibodies specific 
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for each distinguishable 5 • -peptide-labeled primer, (p) 
denaturing DNA contacted in step (o) , and (q) separating 
unbound DNA from bound DNA denatured in step (p) ^ 
which unbound DNA separated in step (q) encodes one or more 
identified alleles underlying the defined phenotype. In one 
5 embodiment, releasing bound DNA from immobilized MutS in step 
(n) is carried out using ATP or proteinase K, in another 
embodiment, the method further comprises a step of using the 
one or more genes or alleles identified to carry out a 
prognosis or a diagnosis. In one embodiment, the one or more 
genes or alleles identified, or an encoded protein thereof, 
^0 is a target for drug intervention, in another embodiment, 
the plurality of source populations is in the range of three 
to twelve source populations. in yet another embodiment, the 
plurality of source populations is in the range of three to 
six source populations. In another embodiment, the plurality 
of source populations consists of four source populations. 

This invention provides a method for identifying one or 
more genes underlying a defined phenotype displayed by a cell 
or individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived. The method 
comprises the steps of (a) hybridizing insert DNA from each 

2 0 cDNA library with itself, (b) contacting each separate 

populPtion of DNA hybridized in step (a) individually with an 
immobilized mismatch binding protein, (c) separating unbound 
DNA from bound DNA contacted individually in step (b) , 

(d) labeling each separate population of unbound DNA 
separ-=*'^ed in step (c) with a distinguishable label capable of 

2^ binding a partner molecule immobilized on a substrate, 

(e) hybridizing DNA separately labeled in step (d) , 

(f) contacting DNA hybridized in step (e) with an immobilized 
mismatch binding protein, and (g) separating unbound DNA from 
bound DNA contacted in step (f ) . 

This invention provides a method for identifying one or 

3 0 more genes underlying a defined phenotype displayed by a cell 

or individual from which a first cDNA library is derived, but 
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not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived. The method 
comprises the steps of (a) amplifying insert DNA from each 
cDNA library by polymerase chain reaction, (b) hybridizing 
each separate population of DNA amplified in step (a) with 
^ itself, (c) contacting each separate population of DNA 
hybridized in step (b) individually with immobilized MutS, 
(d) separating unbound DNA from bound DNA contacted in step 
(c) , (e) labeling each separate population of unbound DNA 
separated in step (d) by polymerase chain reaction 
using a primer having a distinguishable 5 • -peptide-label 
capable of binding a partner molecule immobilized on a 
substrate, (f ) hybridizing DNA labeled in step (e) , (g) 
contacting DNA hybridized in step (f) with immobilized MutS, 
and (h) separating unbound DNA from bound DNA contacted in 
step (g) . 

This invention provides a method for identifying one or 
more alleles underlying a defined phenotype displayed by a 
cell or individual from which a first cDNA library is 
derived, but not displayed by a cell or individual from which 
a plurality of additional cDNA libraries is derived. The 
method comprises the steps of (a) hybridizing insert DNA from 
each cDNA library with itself, (b) contacting each separate 

2^ population of DNA hybridized in step (a) individually with an 
immobilized mismatch binding protein, (c) separating unbound 
DNA from bound DNA contacted in step (b) , (d) labeling each 
separate population of unbound DNA separated in step (c) with 
a distinguishable label capable of binding a partner molecule 
immobili?''=»d on a substrate, (e) hybridizing DNA labeled in 

2^ step (d) , (f) contacting DNA hybridized in step (e) with an 
immobilized mismatch binding protein, and (g) separating 
unbound DNA from bound DNA contacted in step (f ) . 

This invention provides a method for identifying one or 
more alleles underlying a defined phenotype displayed by a 
cell or individual from which a first cDNA library is 

3^ derived, but not displayed by a cell or individual from which 
a plurality of additional cDNA libraries is derived- The 
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method comprises the steps of (a) amplifying insert DNA from 
each cDNA library by polymerase chain reaction, (b) 
hybridizing DNA amplified from each library in step (a) with 
itself, (c) contacting DNA from each library hybridized in 
step (b) individually with an immobilized mismatch binding 
S protein, (d) separating unbound DNA from bound DNA contacted 
in step (c) , (e) amplifying and labeling , each separate 
population of unbound DNA separated in step (d) by polymerase 
chain reaction using a distinct 5 • -peptide-labeled primer, 

(f) hybridizing DNA amplified and labeled in step (e) , 

(g) contacting DNA hybridized in step (f) with an immobilized 
mismatch binding protein, (h) separating unbound DNA from 
bound DNA contacted in step (g) , (i) releasing bound DNA 
separated in step (h) , and (j) separating DNA released in 
step (i) into single strands. 

This invention provides a method for identifying one or 
more alleles underlying a defined phenotype. The method 
comprises the steps of (a) removing mismatched duplex nucleic 
acid molecules formed from hybridization within each of a 
plurality of source populations of nucleic acids, (b) 
retaining mismatched duplex nucleic acid molecules formed 
from hybridization among the plurality of source populations, 
and (c) separating mismatched strands retained in step (b) , 
which separated strands comprise one or more alleles 
underlying the defined phenotype. 



25 
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4, BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic representation of a VGID^^ approach 
for phenotype samples obtained from sources without at least 
one common ancestor (e.gr, cell line samples; healthy and 
^ diseased nodes within an individual tissue sample) . PCR is 
polymerase chain reaction, 

FIG. 2 is a schematic representation of a VGID^'^ approach 
for phenotype samples obtained from sources having at least 
one common ancestor (e.g. tissue samples from healthy and 
•^^ disease-affected siblings) . 

FIG. 3 is a flow chart representation of the phenotype 
selection process to be employed prior to using the VGID^'^ 
method of the invention. 

FIG. 4 is a schematic map of five hDinP clones isolated 
using the VGID^m method and cell line samples as input 
phenotypes. The VGID^m approach employed is that illustrated 
in FIG. 1. A lymphoblast cell line was chosen as cell line # 
1 because it expresses a specific alteration in a DNA repair 
pathway (i.e. " with phenotype" in FIG. 1); a hepatocyte cell 
20 line Was chosen as cell line # 2 (i.e. " without phenotype" in 
FIG. 1). 



FIG. 5A-B Shows BLASTX search results ar-^l computer 
analysis for the hDinP clone listed in SEQ ID NO:l (#1). 

FIG. 6 shows BLASTX search results and computer analysis 
for the hDinP clone listed in SEQ ID N0:2 (Tor-M) , 

FIG. 7A-B shows BLASTX search results and computer 
analysis for the hDinP clone listed in SEQ ID NO: 3 (#3). 

FIG. 8A-B shows BLASTX search results and computer 
analysis for the hDinP clone listed in SEQ ID nO:4 (*1) . 
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FIG. 9A-B Shows BLASTX search results and computer 
analysis for the hDinP clone listed in SEQ ID NO:5 (*2) . 

5. DETAILED DESCRIPTION OF THE INVENTION 

The present invention provides a method, referred to 
5 generally as the ValiGene^^ Q^ne Identification method, or the 
VGID^^ method, for identification of a gene or multiple genes 
linked to a user-specified phenotype. In this regard, genes 
linked to a phenotype include genes which cause the 
phenotype-of -interest , genes which merely contribute to a 
phenotype which is partly due to genetic factors and partly 
due to environmental factors, as well as structurally altered 
genes arising as an effect of a phenotype. The methodology 
comprising VGID^m can foe used to perform a function-based 
analysis of the protein-coding genome of any organism 
irrespective of biological kingdom. Further, the VGID®'^ 
method can simultaneously identify multiple alleles of the 
gene of interest which are associated with multiple 
phenotypes, including disease phenotypes. Accordingly, 
phenotype-specif ic diagnostic tools are provided by genes 
identified using the VGXV^^ method. In particular, such 
diagnostic tools may be used as an indication of the presence 
of the phenotype-of-interest. Further, phenotype-specif ic 

2 0 prognostic tools are provided by genes identified using the 

VOIDS'^ method; such prognostic tools may be used to indicate 
or predict a disease course and/ or outcome for various 
disease phenotypes. 

The VGID^M methodology is based on a constant underlying 
principle, i.e. the ability to specifically trap and 
2^ subsequently release mismatched artificial cDNA hybrids 

formed by annealing interactions between cDNA's originating 
from phenotypically-distinct sources. Thus, the VGID^'^ 
methodology is a powerful molecular comparison tool which 
does not require global sequence information. Instead, a 
comparison among phenotypic groups is accomplished using cDNA 

3 0 annealing interactions and subsequent sorting of matched from 

mismatched hybrids. 

- 33 - 



BNSDOCID: <WO.. _9936575Al _.!._> 



wo 99/36575 PCT/US99/01037 

The details of the VGIDSm method vary depending upon the 
precise "comparison" to be made. This is due to the fact 
that mRNA transcripts derived from different sources will 
vary in their "complc---!. ty" (i.e. genetic heterogeneity) and 
I therefore should undergo slightly different processing 

5 approaches, as described in detail below. In one embodiment, 
I the VGIDSM method is used to isolate transcripts that are 

I identical among phenotypically-distinct groups. in another 

embodiment, the VGIDS^ method is used to isolate transcripts 
^ that are different among such groups. The VGIDS" method is 

broadly applicable for identifying the genes underlying 
10 specific functions. Common input phenotypes for use with the 
VGIDSM method are healthy (normal) and affected (disease) 
phenotypes. Other common input phenotypes are susceptible 
and resistant phenotypes (e.g. viruses susceptible and 
resistant to antiviral agents, microbes susceptible and 
I resistant to antibiotics, plants susceptible and resistant to 

i 15 herbicides, insects susceptible and resistant to 

insecticides). in this regard, one skilled in the art will 
recognize that the VGIDSm method may be applied virtually 
anywhere two or more input phenotypes are identified, 
regardless of biological kingdom. Guidelines for input 
phenotype selection are provided in Section 5.4 hereinbelow. 
20 The VGIDSM method utilizes nucleic acids obtained or 

derived from at least two source groups as starting material. 
In a preferred embodiment, the nucleic acid is cDNA made from 
messenger RNA (mRNA) , preferably total poly A RNA from the 
source groups. Small quantities of mRNA are sufficient for 
, using the VGID^m method. This flexibility in x.xput amount 

25 permits a meaningful genetic analysis of rare disease tissue 
^ samples, for example. The lower limit amount of source 

nucleic acid required is the minimal amount sufficient for 
construction of a cDNA library (i.e. about l ng to l per 
source with most cDNA library construction techniques) . 

At its most basic level, the VGIDSm method may be thought 
Of as an expressed gene subtraction technique. The VGID^m 
method is based upon two rounds of highly efficient mismatch 
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binding protein chromatography for trapping (e.g. by binding 
to immobilized MutS) of: (a) Internally heterologous nucleic 
acids (round one; see upper columns in FIGs. i and 2); and 
(b) externally heterologous nucleic acids (round two; see 
lower columns in FIGs, l and 2), as described below. In this 
5 regard, internally heterologous nucleic acids refers to 
heterologous nucleic acids (i.e. nucleic acids that do not 
have identical counterparts) within each of two or more 
source groups, and externally heterologous nucleic acids 
refers to heterologous nucleic acids between the source 
groups. In the first round, it is generally the untrapped 
10 material from input phenotypes which is of primary interest 
(such untrapped material is said to be "homogenized") . By 
contrast, in the second round, the material of interest is 
often the trapped material. This trapped material must 
necessarily be an artificially-formed, hybrid duplex of 
similar, yet non-identical cDNA strands, one strand 
15 originating from material left untrapped in the first round 
subtraction step. For use in the VGID^m method, nucleic acids 
are obtained from at least two sources. Best results are 
obtained where most of the nucleic acids are structurally 
identical between different sources since this will result in 
the most effective subtraction in the second round. This 
2 0 situation is most likely to arise where the input source 

groups are phenotypically identical but for the phenotype-of - 
interest. Accordingly, the choice of input sources 
ultimately determines whether the expressed gene-of -interest 
is identified. Often, the most appropriate samples are 
obtained 2rom large families containing several affected and 
25 unaffected individuals. In the context of positional 

cloning, the reasons behind this were explained above. In 
the VGID^M method, families (particularly families where 
consanguinity, i,e. relationship by a common ancestor, is 
known to exist) may also provide the most appropriate 
samples, but for entirely different reasons. 

30 
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Consanguinity gives rise to the direct, non-recombined 
inheritance of genetic elements which, alone or in 
association with other factors, can have pathogenic effects • 
This property of consanguinity can be turned into a 
considerable advantage in the search for genes directly 
^ associated with pathologies. In the presence of 

consanguinity, it would be expected that all diseased 
individuals taken from three generations within the same 
disease-transmitting family (or otherwise inbred) will carry 
the same disease-causing locus, and hence be identical-by- 
descent at this locus. 

XO 

5.1. GENETIC HETEROGENEITY 

The various cell lines which are available for a given 
cell type (e.gr. hepatocytes) are characterized by functional 
similarities and differences (i.e. phenotype) as well as by 
structural genomic similarities and differences (i.e. 
genotype) . In each cell line, the phenotype arises from a 
unique source, i.e. the expressed genes of that cell line. 
Samples of a given tissue originating from different 
individuals are also characterized by phenotype and genotype. 
However, in tissue samples, the phenotype arises from the 
aggregate contributions of the expressed genes of several 
different cell types and these contributions cannot be 
individually isolated. 

The consequences of the above for gene identification 
according to the present invention are two-fold. First, 
tissues are most useful for isolating genes linked to a 
broadly-defined pLenotype, such as the presence of a disorder 
2S affecting individual A but not individual B. Tissue samples 
are less useful for isolation of unknown genes associated 
with a narrowly-defined phenotype. Second, cell lines are 
most useful for isolating genes linked to a very clearly- 
defined molecular function (e.g, a particular form of DNA 
repair such as that performed by hDinP; see below) . The 
3*^ specific methods described below to isolate unknown genes 
from tissues and cell lines are therefore different. 
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To elaborate further, it is useful to compare function 
and genetic makeup in tissues and cell lines. 

With regard to function, note that all cells of a cell 
line population are clonal in origin. That is, they not only 
descend from a single unique cell, they are actual copies of 
^ the ancestor cell. All cells of a cell line are therefore 
functionally identical. By contrast, a tissue sample is 
composed of many different cell types carrying out different 
functions, and each given cell type population is made up of 
groups of functionally similar cells having different 
lineages • 

With regard to genetic makeup, different cell lines are 
of completely different origin. That is, the ancestor cells 
which gave rise to different cell lines came from different 
individuals. Cell lines therefore carry entirely different 
genomes. By contrast, the various cell types comprising a 
given tissue all share the same genome, irrespective of the 
functional differences among the cell types within the 
tissue. 

Members of a given cell line population initially 
present very high Internal consanguinity and very high 
functional identity. However, due to fast growth rate in 
artificial conditions, little or no selective pressures and 
no possibility to eradicate aberrant cells (i.e. no immune 
system) , members are free to accumulate mutations and 
transmit them to their direct progeny (so long as these 
mutations do not compromise basic metabolism) . Therefore, a 
cell line population potentially carries a wide variety of 
newly-acquired mutations. This not only reduces the 
structural genomic homogeneity of the population, but also 
allows different members of the population to express 
different forms of a given gene (i,e, mutant alleles), as 
well as genes that are not expressed at all by other members 
of the cell line population (since a mutation in one gene may 
affect the expression of other genes). These effects can 
result in the presence of a wider spectrum of transcripts 
than might initially be expected from a homogeneous cell 
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population. Awareness of these effects allows for a measure 
of control by, e.g., careful attention to growth conditions 
and cell passage number- 

These effects are exacerbated when functionally 
different cell lines are concurrently utilized. The original 
^ allelic forms and distribution of genes in the genome of a 
first cell line will be different from that found in a second 
cell line, but neither cell line will be subject to enforced 
internal genomic homogeneity. Furthermore, since the two 
cell lines are functionally different, the spectrum of 
expressed transcripts in one population will be different 
from that present in the other population. 

On the other hand, tissue samples comprising many cell 
types present very high internal consanguinity but very high 
functional diversity. In tissue samples, unlike cell lines, 
genomic homogeneity is maintained by the immune system of the 
individual since most aberrant cells are immediately 
eradicated. This enforcement of genomic homogeneity by the 
immune system works to reduce the spectrum of transcripts 
found in tissues. However, the wide variety of cell types 
within a given tissue generally more than makes up for this 
effect. For example, different cell types often express 
different isoforms of a gene family represented by multiple 
gene copies in the genome (a phenomenon known as 
differentiation-specific expression) . The net result is the 
presence of an increasing spectrum of different transcripts 
exprc::ssed in tissues as the number of cell types increases. 
The final expression complexity level is therefore much 
higher in tissues than in cell lines. 

25 

5,X.l, GENETIC HETEROGENEITY IN CELL LINES 

When starting with cell line samples, the genes-of- 
interest to be identified may already be well defined in 
terms of their precise molecular function (for an example, 
see Section 6 hereinbelow) . The sources of genetic 
3^ heterogeneity in cell lines are quite different than in 

tissues. First, there is heterogeneity associated with the 
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genetic differences internal to each cell line. Second, 
there is heterogeneity associated with the functional 
characteristics of each cell line. Third, there is 
heterogeneity associc^oeicl with the genetic differences between 
cell lines. 

^ It is the solution (I.e. removal) of the Internal 

sources of heterogeneity in the first step of the VGID^"^ 
method together with the complete retention ^nd utilization 
of the other two sources of heterogeneity in the second step 
which leads to the dir -t isolation of the target genes of 
interest under the VGID^^ approach outlined in FIG. 1. That 
is, by first retaining only transcripts structurally 
identical within each cell line, one removes internal 
heterogeneity. By next removing all transcripts identical 
bBtween the two cell lines, one is left only with transcripts 
specific to the key functions associated with the cell line 
expressing the phenotype-of -interest. The choice of 
appropriate cell lines is therefore crucial. 

The practical aspects of unknown gene isolation from 
cell line samples are thus entirely defined, as described in 
detail below in Section 5.2.1. The first step in the 
approach used for cell lines separately isolates from each 
cell line nucleic acids (e.g, transcripts) that are 
structurally identical internally. The second step uses the 
nucleic acids (e.g. transcripts) from the unspecialized cell 
line (i.e. "without phenotype" in FIG. 1) to subtract their 
homologues (i.e. structurally identical externally) from the 
specialized cell line (i.e. " with phenotype" of interest, see 
FIG. 1) . The second step utilizes MutS, together with 
25 another trapping system (e.g. streptavidin-coated beads, see 
below) , to recognize only material originating from the 
unspecialized cell line (i.e. hybrid as well as native 
duplexes) . The material remaining at the end of the 
operation corresponds to those few nucleic acids (i.e. 
transcripts) which are entirely specific (i.e. 
^0 differentiation-specific) to the specialized cell line. 
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5,X.2. GENETIC HETEROGENEITY IN TISSUES 

When starting with tissue samples, unlike with cell 
lines, the genes of interest will usually be defined only in 
terms of their phenotypic effects (i.e. presence or absence 
of a disease or trait) . Furthermore, there is no complete 
assurance that, in genetically different individuals, the 
same phenotypic trait does not have entirely different 
causes. To further complicate matters, the material utilized 
(e*g. mRNA) according to the second approach of the VGID^"^ 
method comes from a compl x source (as explained in detail 
above) in that: (a) tissue are made of different cell types 
that cannot be separated; and (b) tissue samples are provided 
by different individuals. 

For tissue samples, three sources of genetic 
heterogeneity exist to contend with in the isolation of the 
genes of interest, including disease (affected) genes. 
First, there is heterogeneity associated with a target tissue 
comprised of multiple cell types. Second, there is 
heterogeneity associated with phenotypic differences among 
normal and affected individuals which do not give rise to 
disease. Third, there is heterogeneity associated with the 
genetic differences among normal and affected individuals 
which gives rise to the disease. 

It is the solution (i»e. removal) of the first and 
second sources of heterogeneity which directly leads to 
isolation of disease genes using the VGID^m approach outlined 
in FIG. 2. By selecting as tissue donors several affect&d 
and several healthy members of the same genetic group (i.e. 
consanguineous donors) and then pooling the tissue extracts 
into only two groups, three things are accomplished. First, 
genetic differences between affected and unaffected 
individuals are considerably reduced; second, phenotypic 
homogeneity among the affected individuals is vastly 
increased; and third, genetic heterogeneities within each 
sample group are homogenized. 

The practical aspects of unknown gene isolation from 
tissue samples are thus entirely defined, as described in 
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detail below in Section 5.2.2, The first step utilizes 
mismatch binding chromatography to isolate transcripts which 
are structurally identical among affected individuals (I.e. 
column flow through, see upper column in FIG. 2). These 
structurally identical transcripts are then used to isolate 
their structurally different counterparts from the unaffected 
pool in a second round of mismatch binding chromatography 
(see lower column in FIG. 2) . In this way, none of the 
transcripts structurally identical between affected and 
unaffected pools will be trapped by mismatch binding and none 
of the transcripts structurally different within the 
unaffected (healthy) pool will be selectively recovered from 
the material released from binding. 



5*2. TWO APPROACHES FOR THE VGXD^^ METHOD 
The VGID^^ method is designed to identify genes by- 
isolating nucleic acids derived from transcripts that are 
associated with a given phenotype in the complete absence of 
pertinent molecular information. In this context, a 
phenotype corresponds to a detectable biological difference 
between otherwise-comparable tissues or cell population 
samples. Biological differences may range from narrow, well- 
defined metabolic functions (e.g. DNA repair) to broad, less- 
well-defined clinical observations (e.g. schizophrenia or 
Alzheimer's disease). As opposed to other expressed 
transcript isolation methods (e.g. cDNA subtraction 
technologies) , the VGID^^ process does not require subtraction 
steps based upon known sequences. Moreover, the VGID^"^ 
process does not require any molecular choices to be made by 
the user. Instead, the VGID^^^ user need only select the input 
phenotypes for comparison. 

The operating principle of the VGIDS'^ process makes use 
of the fact that any detectable biological difference 
existing between two or more otherwise-similar samples almost 
always depends, at least in part, from the presence of 
concomitant transcriptional differences between these 
samples. In order to isolate transcripts associated with a 
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phenotype-of-interest using the VGID^^^ method, one does not 
speculate regarding possible structures that need to be 
isolated or discarded. Instead, one merely chooses the input 
phenotypes for use in the VGXD^^ comparison assay. 

While the VGID^^ method does not allow one to directly 
^ identify (i,e, by mismatch binding) promoter-associated 

mutations contained in non-transcribed portions of genes, any 
transcripts that are over- or under-expressed as a result of 
such mutations can be identified (e.g. see the first approach 
described in Section 5.3.1 and the Example in Section 6 
hereinbelow) . In summary, the VGID^^ process allows isolation 
and identification of overexpressed, underexpressed or 
mutated transcripts that specifically differ between two (or 
more) transcript source populations. 

The VGID^*^ method may be applied to any two or more 
nucleic acid source populations. Nucleic acid source 
populations used in the VGXD^^ method are derived from 
transcript sources (i*e. messenger RNA from cellular sources) 
preferably by converting the iriRNA to double-stranded cDNA. 
Transcript sources include, but are not limited to, animals, 
plants, and microorganisms, including viruses. For example, 
the VGID^f^ method may be applied for the isolation of 
microbial genes conferring resistance to toxic compounds or 
metabolites. As another example, the VGID®"^ method may be 
applied to isolate plant genes conferring desirable traits 
for crop production. For tissues and cell lines, transcript 
sources may include, but are not limited to: (a) tissue nodes 
within an individual tissue sample (first approach) ; (b) cell 
line samples (first approach) ; and (c) tissues samples 
originating from familial clusters having consanguinity 
(second approach) . The first and second VGID®^ approaches 
that can be most commonly used for these various transcript 
sources are described in detail below. 
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5-2.1. FIRST APPROACH: CELL LINES OR SOLE 

TISSUE SAMPLE 

This approach is particularly well suited to the study 
of genes associated with specific metabolic functions (e.g. 
in a cell line displaying the phenotype-of-interest) or with 
^ disease processes where affected tissue samples are limited 

f and where control tissue from healthy individuals cannot be 

f obtained. This approach also allows comparative study of 

sporadic versus familial forms of a given pathology. 

Three different yet complementary transcript isolations 
may be performed using the first VGID^^ approach, as follows: 
(i) isolation of transcripts overexpressed (or unilaterally 
expressed) in the presence of the phenotype-of-interest; (ii) 
isolation of transcripts underexpressed (or unilaterally 
repressed) in the presence of the phenotype-of-interest; and 
(iii) isolation of transcript variants (i.e. mutants) 
associated with the phenotype-of-interest. 

^ The overall experimental scheme for using the VGID^'^ 

method under the first approach is illustrated in FIG. 1. 
The first approach identifies a gene or genes underlying a 
defined phenotype in two steps by, first, removing mismatched 
duplex nucleic acid molecules formed from hybridization 
within each of two source populations and, second, retaining 
mismatched duplex nucleic acid molecules formed from 
hybridization Jbetween the two populations. What follows is a 
preferred embodiment of the first approach; various 
modif xcations that can be made will be apparent to one of 
skill in the art (e.g. see Section 5.3 hereinbelow) • 

^ ."^lelection of input phenotypes is performed by the user, 

25 and can be carried out as desired. Nevertheless, preferred 

^ guidelines for phenotype selection (choosing transcript 

■'^ sources) are provided hereinbelow in Section 5.4. Following 

phenotype selection, an independent (i.e. separate) cDNA 
library is generated for each of two or more transcript 
sources which differ in the phenotype-of-interest. Cell 
lines may be used as transcript sources. Alternatively, a 

m single tissue sample from an affected (i.e. disease) 
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individual may also be used. In the latter instance, 
different cellular nodes are isolated from within the single 
tissue sample, each node representing a different 
pathological stage or phenotypic state. 

In one embodiment, samples from transcript sources are 
processed in pairs, each member of a pair representing a 
different phenotype. In another embodiment, samples are 
processed in groups of three or more (a.Jc.a. multiplex VGID^^ 
methodology) , The two steps of the method can be further 
subdivided into several parts for clarity of illustration. 
For example, in the preferred embodiment described below, the 
first step comprises parts 1-3 and the second step comprises 
parts 4-6, as follows. 

Part 1. Each cDNA library originating from each 

independent source (e.g. cell line or tissue 
node) is subjected to a limited PGR 
amplification (15-20 cycles) in order to 
linearize the cDMA inserts. 



Part 2 . The PGR products obtained from each source are 
independently (i.e. without yet combining 
materials from different sources) denatured 
and reannealed. 



Following parts 1 and 2, transcripts that present 
structural differences within each source population will 
give rise to mismatched heteroduplex molecules. The 
heterologous transcripts in these heteroduplex molecules 
arise from random mutations not associated with the 
phenotype-of -interest . This heteroduplex formation occurs 
since the random mutations encountered should be common to 
only a portion of, and not to all, individual cells within 
each source population. 

Part 3. The reannealed PGR products originating from 
each source are exposed independently to a 
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first round of mismatch column ch.romatography 
{see the top two columns in FIG. 1; e.g. 
columns may be packed with MutS-coated glass 
beads for "automatic trapping" of the 
mismatch-containing heteroduplexes) . 

5 

In part 3 , mismatched heteroduplexes become trapped in 
the column. After several cycles of denaturation and random 
reannealing followed by trapping, the column flow through 
contains primarily transcripts that are structurally common 
to all cells within the source. That is, any heterologous 
transcripts within the Fource are largely removed from the 
material being analyzed during part 3 (see upper waste bin in 
FIG. 1 labeled "removal of heterologous transcripts"). 

Part 4. The cDNA inserts present in the flow-through 

obtained from each cell line are independently 
^5 pcR amplified and labeled. 

In part 4, PCR amplification serves two purposes. 
First, this PCR increases the number of copies of the 
remaining individual cDNA inserts which originated from each 
source population. Second, and more importantly, PCR allows 
2^ independent labeling of inserts originating from each source 
population. In this way, one is able to selectively remove 
or retrieve inserts originating from a given source 
population. For example, FIG. 1 illustrates using two cell 
lines as source populations, with cell line # 1 displaying 
the phenotype-of -interest (" with phenotype" in FIG. 1) . 
Here, inserts that will not be part of the final analysis are 
labeled for removal (i.e. transcripts not associated with the 
phenomenon of interest; see lower right waste bin "removal of 
all molecules with DNA strand from cell line # 2"; see also 
hereinbelow) , 

The labels used in part 4 are attached to the primers 
utilized in the relevant PCR reaction. Suitable labels 
include molecules that can be specifically bound and 
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subsequently removed from solution together with their 
attached PGR products. For example, such labels may be: (a) 
biotin molecules recognized by streptavidin coated onto solid 
supports; or (b) short peptides recognized by specific 
monoclonal antibodies attached to solid supports. The solid 
^ supports used may be beads, resins, nitrocellulose paper, or 
others well known to those skilled in the art. 

Part 5* The PGR amplified DNA, obtained from 

independent sources and subjected to parts 1- 
4, are now combined, denatured and reannealed* 

10 

Part 6. The reannealed PGR products are then exposed 
to a second round of mismatch column 
chromatography (see the lower column in FIG. 
1) • 

In the FIG. 1 approach, the material trapped in the 
lower column is primarily mismatched heteroduplexes composed 
of one non-labeled strand originating from cell line # 1 and 
one labeled strand originating from cell line # 2. This 
material therefore represents transcripts expressed by both 
cell lines but carrying cell line-specific mutations, and may 

2C either be discarded (see lower right waste bin in FIG. 1) or 
recovered, cloned, and analyzed. It is to be well noted that 
if parts 2 and 3 have not first been carried out, any 
material trapped by the source combination in parts 5 and 6 
would not be worth recovering since it would be heavily 
contaminated by random heterologies present in each source 

2^ cell line. 

Recovery of trapped heteroduplexes from a MutS mismatch 
binding column can be performed in at least two ways. First, 
the column may be filled with an ATP-containing buffer. The 
presence of ATP allows the ATPase activity of MutS to release 
trapped heteroduplexes. The concentration range of ATP 
suitable for effecting release is from about 1 mM to about 6 
mM ATP; the optimal concentration of ATP for effecting 
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release is about 3 mM (see e.g* Allen et ai . , 1997, EMBO J. 
16, 4467-4476). Recovery Of trapped heteroduplexes using ATP 
has the added advantage of regenerating the column for 
subsequent use. Second, recovery may be effected using a 

■4 protease (with the caveat that certain proteases may not be 

^ suitable for use with certain short peptide labels) . For 

'I example, the column may be treated with a protease-containing 

i buffer {o.g. proteinase K) , resulting in the destruction of 

the MutS protein molecules immobilized in the column and the 
subsequent release of the trapped heteroduplexes. 

Trapped material from the lower column in the example of 
■^^ FIG. 1 is composed of one labeled and one non-labeled strand. 
This material may be discarded if one is only interested in 
transcripts from cell line # 1 (see lower right waste bin in 
FIG. 1) . Alternatively, this material may be specifically 
recovered (e.g. using streptavidin or antibody-coated beads, 

I depending upon the label used at part 4 above) , for an 

examination of the genetic differences in transcripts 
expressed by both input cell lines. If this specific 
recovery is desired, the isolated material is PCR amplified 
over a few cycles for production of clonable fragments having 
non-labeled 5» ends. It is noteworthy that recovery here 
preserves the original structures specific to each cell line 
since, in the PGR reaction, each strand of the original 
mismatched heteroduplex independently gives rise to a 
perfectly matched homoduplex. It is also possible to 
separately clone the transcripts arising from each cell line 
source. This is accomplished by denaturing the 
heteroduplexes releasea from the column and subsequently 
bound to the label-binder (e.gr. streptavidin beads) , 
. separating the pellet (containing labeled strands) from the 

supernatant (containing unlabeled strands) , and performing 
two PCR reactions using material in the pellet and the 
supernatant as separate templates. 

The material untrapped by the lower column in the 
schematic of FIG. 1 (i.e. column flow-through) potentially 
contains three types of mismatch-free duplex DNA molecules, 
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as follows. First, it can contain unlabeled homoduplexes 
which primarily represent transcripts that are overexpressed 
or unilaterally expressed by the "non-labeled" cell line 
(i.e. transcripts that have no, or very few, counterparts in 
the "labeled" cell line) • Second, it can contain 
^ homoduplexes labeled on both strands which primarily 

represent transcripts that are overexpressed or unilaterally 
expressed by the "labeled" cell line (i.e. transcripts that 
have no, or very few, counterparts in the "non-labeled" cell 
line) . Third, it can contain hybrid homoduplexes labeled on 
one strand only which represent transcripts common to both 
cell lines expressed at comparable levels. 

As in the case of the mismatched material, singly- 
labeled homoduplex hybrids as well as doubly labeled 
homoduplexes can be specifically removed from solution, 
leaving behind transcripts originating from the non-labeled 
cell line that have no counterparts in the labeled cell line. 
It should be noted that transcripts specific to the labeled 
cell line (i*e. doubly-labeled homoduplexes) cannot be 
isolated from transcripts common to both cell lines (i*e. 
singly-labeled homoduplex hybrids) under the scheme 
illustrated in FIG. 1. In order to isolate these 
transcripts, the labeling strategy is reversed and the 
experiment repeated. Alternatively, a different labeling 
strategy altogether (i.e. two-label strategy) may be employed 
in which transcripts originating from cell line #1 are not 
left unlabeled. Here, the "labelling" step diagrammed in 
FIG, 1 is performed on both upper column flow throughs, using 
a distinct label for each column. 
25 Thus, in the single experiment outlined above, 

transcripts specific to one cell line (or tissue node) can be 
isolated from transcripts that bear cell line-specific (or 
node-specific) mutations. It is to be further noted that, by 
using two or more different labeling agents (e.gr. biotin and 
one or more short peptides), the approach can be multiplexed. 
That is, using multiple labels, several different cell lines 
or tissue nodes can be analyzed concurrently and the 
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transcripts specific to each component individually isolated. 
Multiplexing is limited only by the number of available 
labels and the user's imagination in choosing input 
phenotypes , 

5 5 -2 •2, SECOND APPROACH: SAMPLES FROM 

ORGANISMS HAVING CONSANGUINITY 

The second approach for using the VGID^^ method is 
particularly appropriate for the isolation of founder effect 
mutations from population samples having consanguinity, i,e. 
at least one recent common ancestor. What follows are 
preferred embodiments of the second approach; various 
modifications that can be made will be apparent to one of 
skill in the art (e.g. see Section 5,3 hereinbelow) . In one 
preferred embodiment, individuals of a population have a 
common parent or grandparent. In another embodiment, 
individuals of a population share a common ancestor within 
three generations (i.e. great grandparent). In still another 
embodiment, individuals of a population share a common 
ancestor within ten generations. Here, obtaining control 
tissue samples from healthy relatives is an absolute 
requirement. While the overall procedure, which is 
illustrated in FIG. 2, is similar to the first approach in 

2^ that it utilizes mismatch binding, there are important 

differences under the second approach, as described below. 

First, as just mentioned, affected and healthy 
individuals contributing tissue samples must all share at 
least one recent common ancestor (i.e. consanguineous 
individuals) . Of course, the population labels "affected" 

25 "diseased") and "healthy" (or "control") are arbitrary in 

that two input populations differing in a phenotype-of- 
interest (and not necessarily a disease) are all that is 
required, so long as all individuals contributing to the 
input populations have consanguinity. 

Second, at least two (2) affected and two (2) healthy 
relatives should be sampled for optimum results. For best 
results, samples should be collected from five to six 
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diseased individuals and an equal number of healthy 
individuals. Individuals need not all come from the same 
nuclear family (defined herein as having a common mother or 
father) and they need .^t be age-matched. 

Third, the cDNA libraries constructed from each tissue 
^ sample should be normalized in order to reduce the chances of 
missing rare transcripts. Library normalization techniques 
^ include any of those known to one skilled in the art, such as 

those described in Section 5,3. hereinbelow. 

Fourth, a homogBni-^ itlon step is performed on the 
samples obtained from affBcted individuals. Homogenization 
is carried out as follows: A sample is obtained from each 
affected individual and is used to construct an independent 
(i.e. separate) cDNA library; each library is then PGR 
amplified and the resultant products from all affected 
individuals are mixed together. Denaturation, reannealing 
^ and trapping of mismatched duplexes over an immobilized MutS 

column is performed (see upper column in FIG. 2) . Although 
this homogenization step will result in a 50% reduction in 
frequency of heterozygous mutant transcripts in the flow- 
through material, the step is preferable to insure the 
isolation of transcripts structurally common to all affected 
individuals (see upper column flow through in FIG. 2) . The 
material recovered in the upper column flow through is then 
PCR-labeled as described above. 

Fifth, a homogenization step like that performed on 
affected samples is not applied to healthy (i.e. control) 
samples. Instead, PGR material obtained from each control 
cDNA library is mixed together and then directly added to the 
affected, PGR-labeled products obtained from the upper column 
flow through illustrated in FIG. 2. This complex mixture is 
then denatured and randomly reannealed before exposure to the 
lower MutS column illustrated in FIG. 2. The major reason 
behind this step is to provide an efficient counterbalance to 
the effects of consanguinity. The more closely related 
3^ affected individuals are, the greater the number of 
... structurally identical loci they hold in common. As a 
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result, the pool of transcripts remaining after the 
homogenization step may be quite large, but only a few of 
these are likely to be relevant to the disorder. 
Furthermore, familial genetic disorders are often associated 
with specific mutations that are frequent among affected 
^ members of the disease-transmitting family. However, this 
does not mean that unaffected individuals are mutation-free 
at the loci concerned. It simply means that unaffected 
individuals have inherited polymorphisms other than those 
ssociated with the disease, and there could be many such 
silent polymorphisms. 

The net result of the above considerations is that while 
affected individuals within a disease-transmitting family are 
very likely to share the same mutations, healthy members of 
the family do not necessarily have the same "healthy" 
alleles. Therefore, in order to identify the mutant loci 
associated with a familial disorder (together with healthy 

^5 allelic forms) , it is highly advisable to first isolate 

transcripts structurally common to all affected individuals 
(i.e. to reduce the complexity by homogenization). At the 
same time, it is highly advisable to maintain as much 
diversity as possible within control samples in order to 
maximize chances of isolating all healthy allelic variants. 

20 Accordingly, in approach 2, mismatched heteroduplexes 

that are trapped by the second column (i.e. lower column in 
FIG. 2) have potentially two sources: (a) unlabeled 
heteroduplexes with both strands originating from healthy 
individuals; and (b) hybrid heteroduplexes labeled on one 
strand originating from affecred individuals and representing 

2^ transcripts structurally common to all affected individuals 
which are also present in their healthy relatives with a 
sequence difference. Thus, any mutant alleles associated 
with disease, as well as their "healthy" counterparts, will 
be found in the trapped material. Following release of the 
trapped material with either ATP or proteinase K (as 

^0 described above) the labeled strand can be specifically 
removed from solution. 
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The flow through material of the second (lower) column 
in FIG, 2 potentially contains: (a) singly-labeled mismatch- 
free duplexes representing transcripts structurally common to 
affected and unaffected relatives; (b) doubly-labeled 
mismatch-free duplexes representing transcripts structurally 
^ common to affected relatives only; and (c) unlabeled 

mismatch-free duplexes representing transcripts present in 
unaffected relatives only. These can be specifically 
recovered by removing from solution all labeled, mismatch- 
free duplexes using a label binder (e,g. streptavidin-coated 
beads) • 

It is to be noted that transcripts specific to affected 
individuals only (i.e. doubly labeled mismatch-free duplexes) 
cannot be direccly recovered from the lower (second) column 
of FIG* 2 using this approach. To isolate such transcripts, 
it would be necessary to reverse the labeling strategy and 
repeat the experiment (i.e. label during the lower left "PGR" 
diagrammed in FIG. 2) . When labeling only healthy 
individuals, however, mutations associated with disease 
cannot be isolated from mismatched heteroduplexes trapped in 
the lower MutS column (singly-labeled mismatched hybrids) ; 
further, the vast majority of trapped material will originate 
from the healthy individuals alone due to the absence of 

2^ selective recovery of transcripts structurally common to all 
healthy individuals (i.e. homogenization) . Furthermore, a 
selective recovery step (i.e. a parallel first column) to 
homogenize the nucleic acid population cannot be carried out 
on healthy relatives without a serious risk of losing the 
relevant alleles through the presence of silent polymorphisms 

2S which will generate numerous mismatched heteroduplexes at the 
denaturation-reannealing step and that would remain trapped 
(upper waste bin of FIG. 2) in the first round of MutS 
chromatography. Alternatively of course, as described for 
the first approach in Section 5*3.1. above, more than one 
label may be used. 

It should be well noted that the higher the inbreeding 
levels in the families contributing normal and disease 
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samples, the fewer the number of mismatched loci ultimately 
obtained. Although all mismatched loci identified in this 
way will serve as markers to differentiate healthy from 
diseased individuals, it should also be noted that silent 
;^ genetic polymorphisms (i.e. harmless, non-disease-associated 

5 changes in DNA) will be identified as well. Accordingly, 
d best results in identifying disease genes will be obtained 

I using highly inbred populations since inbreeding reduces the 

number of silent genetic polymorphisms between input sources 
to a minimum. 

The genetic loci identified by the above procedure can 
10 be used as probes in population studies carried out by the 
standard immobilized MutS genotyping approach on genomic DNA 
obtained from affected individuals and healthy individuals 
(see Wagner et ai . , 1995, Nucl. Acids Res. 23, 3944-3948). 
Subsequent statistical analysis, well known to those skilled 
I in the art, will then easily identify the loci and the 

'?fk 15 alleles associated with susceptibility and resistance to the 

disease. 

In suinmary, the second VGXD^^ approach provides numerous 
advantages in the search for disease-causing genes from 
consanguineous sample populations. First, the approach turns 
highly inbred populations into an asset as opposed to a 
2 0 liability. Second, the approach allows rapid gene 

identification in cases where the lack of physiological 
and/or biochemical information is such that there is no basis 
on which possible candidate genes could be propjsed. Third, 
the approach allows rapid identification of genes and all 

«t alleles directly and indirectly associated with 

25 susceptibility and resistance to a disease. Fourth, the 

approach can be applied to any consanguineous population, in 

I many contexts, ranging from the search for susceptibility or 

resistance genes associated with multifactorial diseases, to 
the search for rare genes conferring desirable monogenic 
traits. 

^0 The number of clones sequenced from the output obtained 

under either approach to the VGID®'^ method is as desired by 
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the user. Optionally, one or more (e.gr. five or six) clones 
among those initially identified are sequenced to sample the 
results- Nucleic acid sequences are then computer analyzed 
for open reading frames, and used to drive a protein database 
search to determine whether any portions correspond to 
5 portions of known proteins. It is preferable to perform such 
a search by translating the nucleotide sequence into all six 
possible reading frames (3 in each direction) in order to 
detect any proteins existing in the database. Of course, the 
YQj£)SM method will also identify genes not yet represented in 
any database. In this instance, gene function may he 
inferred from the observed functional differences between the 
input phenotypes. 

5.3. MISCELLAKEOUS METHODS USED IN CONJUNCTION WITH 
THE VGID^M METHOD 

Nucleic acid (e.g, mRNA) extraction and cDNA synthesis 
are performed using techniques well known to those skilled in 
the art* For example, Gibco-BRL Trizol kits may be used for 
mRNA preparation and Promega Universal Riboclone kits may be 
used for cDNA synthesis, both according to the manufacturers* 
protocols. The synthesized cDNA may be size-selected by any 
of the techniques well known to those skilled in the art. 

20 Yor example, agarose gel electrophoresis, sucrose density 
gradient chromatography, molecular sieve chromatography or 
high performance liquid chromatography may be used. The cDNA 
fragments subsequently cloned may range from below 100 bases 
up to 10 kilobases or more. However, it should be recognized 
that the optimum size for error-free PGR is about 600 bases. 

25 It should further be recognized that the optimum size for 
error-free reverse transcription is about 400 bases. A 
suitable viral reverse transcriptase is that obtained from 
Maloney murine leukemia virus (MMLV) . If cDNA is 
fractionated by agarose gel electrophoresis, it may be 
recovered from gel slices using a variety of techniques well 
known in the art. For example, fragments may be collected by 
overnight diffusion into a small liquid volume, or by using 
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one of many commercially-available kits, such as Gel-Clean 
(Promega) or QiaQuick (Qiagen) , 

There are no special considerations when choosing a 
' vector for cDNA librarv -instruction. The VGID^^ method will 

work independently of the specific library vector employed. 
^ Often, the best vector will be the one which the user is most 
ij familiar with. Of course, the most important consideration 

I for best results will be to ensure that the libraries 

constructed represent rare as well as abundant transcripts, 
e.gr, by normalizing the libraries. 

Library inserts are PGR amplified using oligonucleotide 
v:. primers (oligonucleotides) specific to the cloning vector. 

Labeled oligonucleotides are used as suitable for the 
particular experimental design being used. For example, in 
the first VGID^^ approach (see FIG. 1) , the oligonucleotides 
used for the cell line #2 library are labeled with biotin 
r| (see Example in Section 6) . Any heat stable polymerase may 

be used, but those with the lowest error rate available are 
preferred to reduce the number of mismatches created during 
the PCR. Examples of suitable enzymes are Tag DNA polymerase 
and Pfu DNA polymerase. It is important to remember that 
large numbers of cycles are not reguired since the goal is 
simply to produce linearized (and, where needed, labeled) 
2^ fragments from the library. The PCR products are column 
purified, heat-denatured, annealed, and cooled to room 
temperature. 

Subtraction of heteroduplex DNA is performed on 
renatured, cooled PCR products using mismatch-binding 
chromatography. This may be conveniently perfoxiu^d in a 
2^ variety of formats including a test tube format, a column 
format, or any other format selected by the user which 

S permits heteroduplex DNA to bind to iimnobilized mismatch 

binding protein- For example, DNA in a reaction buffer (e.gr. 
350 ng in 100 fj,l) may be placed into a vessel (e.g. 0.5 ml 
Eppendorf tube) containing MutS (e.g. 10 fig) adsorbed onto 
glass beads (e.g. 100 /xm diameter, acid washed, from Sigma 

ig Chemical Co.), The incubation phase is performed for a time 
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sufficient to allow mismatch binding to occur (e.gr. 15-55 
min) * The incubation time may vary according to measures 
taken to increase the contact surface area between the 
immobilized mismatch binding protein and the reannealed cDNA. 
Such measures may include slowly rotating the vessel or 
placing the vessel in a horizontal position. The unbound, 
reannealed PCR products left free in solution may be 
recovered as column flow through (in a column format) or as 
supernatant following centrif ugation (in a test tube format) . 

It is often advantageous to repeat the mismatch binding 
protein mediated trapping operation using fresh immobilized 
protein a total of two to four times to insure removal of all 
mismatched heteroduplexes • The optimum number of repetitions 
required will depend primarily on the relative amounts of 
mismatched heteroduplexes to foe trapped and the quantity of 
protein available for trapping in each round. 

It will be advantageous, in some instances, to use an 
excess of DNA from one source over the other when performing 
a subtraction. For example, prior to performing the second 
round of trapping using the approach illustrated in FIG. 1, 
an excess of DNA from the source without the phenotype-of- 
interest (i.e. cell line # 2) may be used over DNA from the 
source with the phenotype-of- interest (i.e. cell line # i) in 
order to insure the complete removal of all transcripts which 
are identical between the two sources. In this regard, the 
source without the phenotype-of- interest may be thought of as 
a molecular mop for removal of undesired transcripts. The 
ratio of excess DNA may vary over a wide range, i.e. from 
1.01: IV^ to 100:1. It will often range from 1.1:1.0 to 10:1. 
It will most often range from 1.5:1 to 6:1. A recommended 
starting ratio is 3:1. 

For best results in obtaining cDNAs which represent rare 
transcripts encoding the phenotype-of -interest , preparation 
of normalized cDNA libraries from each input mRNA source is 
performed. A portion of each input library should be 
preserved in non-normalized form for further analysis, if 
desired. Normalization techniques known in the art include, 
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but are not limited to, those described in the following: 
Soares and Ef stratiadis, June 10, 199 7, Normalized cDNA 
libraries, U.S. Pat. No. 5,637,685; Sankhavaram et al . , March 
1991, Construction of a uniform-abundance (normalized) cDNA 
library, Proa. Natl, Acad. Sci , USA 88, 1943-1947; and Ko, 
^ 1990, An equalized cDNA library by the reassociation of short 
double-stranded cDNAs, Nucl . Acids Res. 18, 57 09. 

Suitable mismatch binding proteins that can be used have 
been previously described (see e.g. Wagner, 11 May 1995, 
Immobilized mismatch binding protein for detection or 
purification of mutations or polymorphisms. International 
Publication Number WO 95/12689) . A preferred mismatch 
binding protein is characterized by its ability to bind DNA- 
DNA duplexes containing mispaired or unpaired bases {Id. at 
13). For example, in addition to E. coll MutS, the mismatch 
binding protein may be human MSH2 (Fishel et al . , 1994, 
Science 266, 1403-1405; Fishel et al . , 1994, Cancer Res. 54, 
5539-5542; Mello et ai . , 1996, Chem. Biol, 3, 579-589), an 
hMSH2-hMSH6 protein complex (Acharya et al . , 1996, Proc. 
Natl. Acad. Sci. U.S.A. 93, 13629-13634; Gradia et al . , 1997, 
Cell 91, 995-1005) , or homologues from various other 
organisms such as yeast (Miret et al . , 1993, J. Biol. Chem. 
268, 3507-3513). 

Suitable conditions for annealing (I.e. hybrxdization) 
reactions have been well described, for example, by Sambrook 
et al., 1989, in Molecular Cloning^ A Laboratory Manual, 2d 
Edition, Cold Spring Harbor Laboratory Press, Cold Spring 
Harbor, New York. 

Separation of labeled strands from unlabeled strands or 
2^ from differently-labeled strands is performed using standard 
techniques. For example, biotin-labeled strands bound to 
streptavidin-coated beads may be placed into a first 
container or vessel for heat denaturation into single 
strands. After denaturation, the supernatant is removed from 
the first container and transferred to a second container, 
3^ resulting in separation of labeled strands from unlabeled or 
differently-labeled strands. Each set of strands can now be 
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independently PGR amplified (for a few cycles) , cloned and 
sequenced • 

Suitable nucleic acid labels and their partner molecule: 
or agents {I.e. binding partners) include, but are not 
limited to, biotin and streptavidin, and short peptide labeli 
^ and monoclonal antibodies. These are discussed in Section 
5.8 and 5.9 Infra. Suitable methods of linearizing inserts 
in a cDNA library include, but are not limited to, PGR and 
digestion with restriction enzyme (s). Suitable methods of 
amplifying cDNA include, but are not limited to, PCR and 
propagation in bacteria. 

LO 

5-3»X. DNA AMPLIFICATION 

The polymerase chain reaction (PGR) may be used in 
connection with the invention to amplify a desired sequence 
from a source (e.g., a tissue sample, a genomic or cDNA 
library) . Oligonucleotide primers representing known 
sequences can be used as primers in PGR. PGR is typically 
carried out by use of a thermal cycler {e.g., from Perkin- 
Elmer Cetus) and a thermostable polymerase {e.g., Gene Amp"^*^ 
brand of Tag polymerase) . The nucleic acid template to be 
amplified may include but is not limited to mRNA, cDNA or 
genomic DNA from any species. The PGR amplification method 
° is well known in the art (see, e.g., U.S. Patent Nos. 

4,683,202, 4,683,195 and 4,889,818; Gyllenstein et al. , 1988, 
Proc. Nat'l. Acad. Sci. U.S.A. 85, 7652-7656; Ochman et al. , 
1988, Genetics 120, 621-623; Loh et al. , 1989, Science 243, 
217-220) . 

Any prokaryotic cell, eukaryotic cell, or virus, can 
^ serve as the nucleic acid source. For example, nucleic acid 
sequences may be obtained from the following sources: human, 
porcine, bovine, feline, avian, equine, canine, insect (e.g., 
Drosophila) , invertebrate {e.g., c. elegans) , plant, etc. 
The DNA may be obtained by standard procedures known in the 
art (see, e.gr., Sambrook et al . , 1989, Molecular Cloning, A 
° Laboratory Manual, 2d Ed., Cold Spring Harbor Laboratory 
Press, Cold Spring Harbor, New York; Glover (ed.), 1985, DNA 
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Cloning: A Practical Approach, MRL Press, Ltd., Oxford, U.K. 
Vol- I, II). 

5 ,3. 2. ADJUSTING STRINGENCY 

Other methods available for use in connection with the 
^ methods of this invention include nucleic acid hybridization 
under low, moderate, or high stringency conditions (e.g.. 
Northern and Southern blotting) . Methods for adjustment of 
hybridization stringency are well known in the art (see, 
e.g., Sambrook et al. , 1989, Molecular Cloning, A Laboratory 
Manual, 2d Ed., Cold Spring Harbor Laboratory Press, Cold 
Spring Harbor, New York; ;5ree, also, Ausubel et al, , eds., in 
the Current Protocols in Molecular Biology series of 
laboratory technique manuals, 1987-1994 Current Protocols, 
1994-1997 John Wiley and Sons, Inc.; see, especially , Dyson, 
N.J*, 1991, Immobilization of nucleic acids and hybridization 
analysis, In: Essential Molecular Biology: A Practical 
Approach, Vol. 2, T.A- Brown, ed- , pp. 111-156, IRL Press at 
Oxford University Press, Oxford, U.K.; each of which is 
incorporated by reference herein in its entirety) . Salt 
concentration, melting temperature, the absence or presence 
of denaturants, and the type and length of nucleic acid to be 
hybridized (e.g., DNA, RNA, PNA) are some of the variables 
considered when adjusting the stringency of a particular 
hybridization reaction according to methods known in the art. 

Conditions of low stringency, by way of example and not 
limitation, may be as follows (see, also, Shilo and Weinberg, 
1981, Proc. Natl. Acad. Sci, U.S.A. 78, 6789-6792). Filters 
containing DNA are pretreated for 6 h at 40°C in a solution 

25 containing 35% formamide, 5X SSC, 50 mM Tris-HCi (pH 7.5) , 
5 mM EDTA, 0.1% PVP, 0.1% Ficoll, 1% BSA, and 500 Mg/ml 
denatured salmon sperm DNA. Hybridizations are carried out 
in the same solution with the following modifications: 0.02% 
PVP, 0.02% Ficoll, 0.2% BSA, 100 fj,g/ml salmon sperm DNA, 10% 
(wt/vol) dextran sulfate, and 5-2 0 X 10^ cpm ^^P-labeled probe 

^® is used. Filters are incubated in hybridization mixture for 
18-2 0 h at 40 "^C, and then washed for 1.5 h at 55 ''C in a 
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solution containing 2X SSC, 25 itiM Tris-HCl (pH 7.4), 5 mM 
EDTA, and 0.1% SDS. The wash solution is replaced with fresh 
solution and incubated an additional 1.5 h at SO^C. Filters 
are blotted dry and exposed for autoradiography* If 
necessary, filters are washed for a third time at 65-68 °C and 
^ re-exposed to film. 

Conditions of high stringency, by way of example and not 
limitation, may be as follows. Prehybridization of filters 
containing DNA is carried out for 8 h to overnight at 65^C in 
buffer composed of 6X SSC, 50 mM Tris-HCl (pH 7.5), 1 mM 
EDTA, 0.02% PVP, 0.02% Ficoll, 0.02% BSA, and 500 iiq/ml 
denatured salmon sperm DNA. Washing of filters is done at 
37 °C for 1 h in a solution containing 2X SSC, 0.01% PVP, 
0.01% Ficoll, and 0.01% BSA. This is followed by a wash in 
O.IX SSC at 50°C for 45 min before autoradiography. 



15 



20 



25 



30 



5,3.3. OLIGONUCLEOTIDE ANALOGS 

Oligonucleotides used in conjunction with the invention 
are often ranging from 10 to about 50 nucleotides in length. 
In specific aspects, an oligonucleotide is 10 nucleotides, 15 
nucleotides, 20 nucleotides or 50 nucleotides in length. An 
oligonucleotide can be DNA or RNA or chimeric mixtures or 
derivatives or modified versions thereof, or single-stranded 
or double-stranded, cr partially double-stranded. An 
oligonucleotide can be modified at the base moiety, sugar 
moiety, or phosphate backbone, or a combination thereof. An 
oligonucleotide may include other appending groups, such as 
biotin, f luorophores, or peptides. 

An oligonucleotide may comprise at least one modified 
base moiety which is selected from the group including but 
not limited to 5-f luorouracil , 5-bromouracil, 5-chlorouracil , 
5-iodouracil , hypoxanthine , xanthine , 4-acetylcytosine , 
5- (carboxyhydroxylmethyl) uracil , 5-carboxymethylaminomethyl- 
2-thiouridine , 5-carboxymethylaminomethyluracil , 
dihydrouracil , beta-D-galactosy Iqueosine , inosine , 
N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 
2, 2"dimeth^.^lguanine, 2-inethyladenine, 2-methylguanine , 
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3 -methylcytosine , 5-methy Icytosine , N6-adenine , 
7-inethylguanine, 5-methylaminoinethyluracil , 

5-inethoxyaininomethyl-"2-thiouracil , beta-D-mannosylqueosine, 
S'-methoxycarboxymethyluracil , 5-methoxyuracil, 
2"inethylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid 
^ (v) , pseudouracil , gueosine, 2-thiocytosine , S-methyl- 
2-thiouracil , 2"thiouracil , 4-thiouracil, 5-methyluracil , 
uracil-5-oxyacetic acid methylestier , uracil-5-oxyacetic acid 
(V) , 5-iaethyl-2-thiouracil, 3- (3-ainino-3-N-2-carboxypropyl) 
uracil, and 2 , 6-diaininopurine. 

An oligonucleotide may compr-ise at least one modified 
phosphate backbone selected from the group including but not 
limited to a phosphorothioate, a phosphorodithioate , a 
phosphoramidothioate, a phosphoramidate , a phosphordiamidate, 
a methylphosphonate, an alkyl phosphotriester , and a 
formacetal or analog thereof* 

An oligonucleotide or derivative thereof used in 
conjunction with the methods of this invention may be 
synthesized using any method known in the art, e.gr., by use 
of an automated DNA synthesizer (such as are commercially 
available from Biosearch, Applied Biosystems, etc,)- As 
examples, phosphorothioate oligonucleotides may be 
synthesized by the method of Stein et al. (Stein et al* , 
1988, Nucl. Acids Res. 16, 3209), methylphosphonate 
oligonucleotides can be prepared by use of controlled pore 
glass polymer supports (Sarin et al., 1988, Proc. Natl Acad. 
Sci. U.S.A. 85, 7448-7451) , etc. An oligonucleotide may be 
an a-anomeric oligonucleotide. An a-anomeric oligonucleotide 
forms soecific double-stranded hybrids with complementary RNA 
in which, contrary to the usual 3 -units, the strands run 
parallel to each other (see Gautier et al. , 1987, Nucl. Acids 
Res. 15, 6625-6641). 

Oligonucleotides may be synthesized using any method 
known in the art (e.gr., standard phosphoramidite chemistry on 
an Applied Biosystems 392/394 DNA synthesizer) . Further, 
reagents for synthesis may be obtained from any one of many 
commercial suppliers. 
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Spacer phosphoramidite molecules may be used during 
oligonucleotide synthesis, e.gr., to bridge sections of 
oligonucleotides where base pairing is undesired or to 
position labels or tags away from an oligonucleotide portion 
undergoing base pairing* The spacer length can be varied by 
consecutive additions of spacer phosphoramidites . Spacer 
phosphoramidite molecules may be used as 5*- or 3'- 
oligonucleotide modifiers- Such spacers include Spacer 
Phosphoramidite 9 {1 .e, , 9-0-Dimethoxytrityl- 
triethyleneglycol , l-[ (2-cyanoethyl) - (N, N-diisopropyl) ]- 
phosphoramidite, and Spacer Phosphoramidite 18 (i.e., 18-0- 
Dimethoxytrityl-hexaethyleneglycol, !-[ (2-cyanoethyl) - (N,N- 
diisopropyl) ] -phosphoramidite) , both available from Glen 
Research (Sterling, Virginia) . 

Other spacers are available for use in standard 
oligonucleotide synthesis. For example. Spacer 
Phosphoramidite C3 and dSpacer Phosphoramidite can be used to 
destabilize undesirable self-hybridization events within 
capture oligonucleotides or to destabilize false 
hybridization events between incorrectly-matched 
template/probe complexes. Such spacers, when positioned at 
the 3" end of an oligonucleotide, will also prevent incorrect 
extension products from being generated when included in a 
PGR reaction mixture- 

One spacer available from Glen Research, Spacer 
Phosphoramidite C3 (i.e., 3-O-Diinethoxytrityl-propyX-l- [ (2- 
cyanoethyl) - (N, N-diisopropyl) ] -phosphoramidite) , can be added 
to substitute for an unknown base within an oligonucleotide 
sequence. 

A branching spacer may be used as one method to increase 
label incorporation into an oligonucleotide. Such a 
branching spacer may also be used to increase a detectable 
signal by hybridization through multiply branched capture 
probes or PGR primers. Branching spacers are available 
commercially, e.g., from Glen Research. 

Biotinylated oligonucleotides are well Known in the art. 
An oligonucleotide may be biotinylated using a biotin-NHS 
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ester procedure. Alternatively, biotin may be attached 
during oligonucleotide synthesis using a biotin 
phosphoramidite (Cocuzza, 1989, Tetrahed. Lett. 30, 6287- 
6290) . One such biotin phosphor-amidite available from Glen 
Research is l-Diinethoxytrityloxy-2- (N-biotinyl-4-aininobutyl) - 
propyl-3-O- (2-cyanoethyl) - (N,N-diisopropyl) -phosphoramidite . 
This compound also has a branch point to allow further 
additions. The branched spacer used in this biotin 
phosphoramidite has been described by Nelson et al. (Nelson 
et al*, 1992, Nucl, Acids Res* 20, 6253-6259). 

Another 5 '-biotin phosphoramidite, namely [l-N-(4,4'- 
Dimethoxytrityl) -biotinyl-6-aminohexyl] -2-cyanoethyl- (N,N- 
diisopropyl) -phosphoramidite, may be used to biotinylate an 
oligonucleotide. This compound is sold by Glen Research 
under license from Zeneca PLC. 

Fluorescent dyes may also be incorporated into an 
oligonucleotide using dye-labeled phosphoramidites . Two such 
labels are 5 » -Hexachloro-Fluorescein Phosphoramidite (HEX), 
and 5 ' -Tetrachloro-Fluorescein Phosphoramidite (TET) , both 
available from Glen Research. 



5.4. PHENOTYPE SELECTION TO OPTIMIZE THE VGID^m 
METHOD 

Bbez results are obtained with the VGID^^ method when 
phenotype selection is first given careful consideration by 
the practitioner (see FIG. 3). What follows is a preferred, 
but not limiting, phenotype selection method. 

The phenotype selection process generally begins with a 
literature review. This involves reviewing biological 
literature, medical literature, chemical literature, 
published bioassays and clinical data in connection with a 
phenotype-of-interest and with any phenotypes to be compared 
with (i.e. subtracted from) the phenotype-of-interest. 

In this regard, reference to the most current edition of 
a catalog of known genetic disease may be made to initiate 
the literature review. For example, one catalog of human 
phenotypes is McKusick, Victor A., Mendelian Inheritance in 
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Man^ Catalog of Autosomal Dominant, Autosomal Recessive, and 
X-Linked Phenotypes (10th Edition, 199 2, The Johns Hopkins 
University Press, Baltimore, Maryland) (hereinafter "MIM™") - 
MIM™ is also available in a continuously-updated, online 
version (hereinafter "OMIM"*''^*' ) , which may be accessed at no 
^ charge by contacting OMIM'^'^ User Support, Welch Memorial 
Library, 1830 East Monument Street, Third Floor, Baltimore, 
Maryland 212 05, or via e-mail to omimhelp@welch • jhu • edu. In 
general, MIM™ and OMIM"^'^ comprise a catalog with one entry 
per human gene locus, whether or not the gene has been 
associated with any particular disease. Each entry, usually 
one or two paragraphs, provides information having the 
following components (when the information is available) : (a) 
title, including an^ synonyms in parentheses; (b) a 
description of the phenotype or gene product; (c) the nature 
of the basic defect in any associated disorder; (d) a 
description of diagnosis and management of the disorder, 
•^^ where applicable; (e) genetics, including mapping 

information; (f) allelic variants; and (g) references. 
Finding aids in MIM^'^ include an author index and a title 
index. The content of OMIM''"^, in addition to being the most 
current data available in these catalogs, is fully computer 
searchable. There were nearly 6,000 entries in the Tenth 
2 0 Edition cf MIM™ (1992) . Therefore, if one makes the usual 
assumption that perhaps 100,000 human genes exist, this 
catalog is only 6% complete. Accordingly, the vast majority 
of genes identified using the VGID™ method will not be 
represented. Nevertheless, existing entries may comprise 
related phenotypes and/ or references which provide insight 
2^ into the genetic nature of the phenotype-of- interest . 

Other useful information sources include computer- 
indexed journal collections such as Medline®. For disease 
phenotypes, internal medicine handbooks may also be consulted 
(see Isselbacher et al . , eds, , Harrison's Principles of 
Internal Medicine, 13th Edition, McGraw-Hill, Inc., New 
^0 York) . The practitioner skilled in the relevant art 
generally knows which literature sources to revirw. 
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Of course, it is not required that a phenotype be 
recognized in the literature as having a genetic component in 
order for the VGID^*^ method to identify genes associated with 
the phenotype. Indeed, it may even be the absence of such a 
.^^ published recognition or understanding which leads the 

^ practitioner to ask what genes are identified using the VGID™ 
I method. In this regard, the practitioner's personal 

?| knowledge or belief is an important factor to be considered 

in phenotype selection. A given phenotype-of -interest can be 
quite complex and will often be polygenic (see discussion in 
Section 2.1 hereinabove). In one embodiment, the VGID^*^ 
method may involve one or a combination of the two approaches 
set forth previously herein. Recall that one approach, 
schematically set forth in FIG. 1, involves performing the 
VGID^^ method using phenotypic groups defined by sources not 
known to share a common ancestor {e.g. most cell line 
^ samples) . The other approach, schematically set forth in 

FIG. 2, involves performing the VGID^^ method using phenotypic 
groups defined by samples obtained from sources known to 
share at least one common ancestor (i,e. consanguineous 
sources) . 

An overview of the phenotype selection process is set 
forth in FIG. 3. In this figure, PRACTITIONER represents an 
2^ individual skilled in the relevant biological art (e.g. 
geneticist, microbiologist, virologist, endocrinologist, 
plant molecular biologist, pathologist, physiologist, 
surgeon, postdoctoral fellow, graduate student, research 
technician) ; LITERATURE SEARCH represents a review of the 
relevant literature performed by the practitioner; PERSONAL 
KNOWLEDGE represents the knowledge, understanding and belief 
in the relevant biological art provided by the practitioner; 
and PHENOTYPE SELECTION represents the identification of the 
appropriate biological samples by the practitioner after 
having considered the literature search and the 
practitioner's personal knowledge* 

30 

5.4*1. TISSUE SAMPLE COLLECTION 
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Tissue samples are typically collected using methods 
well known to those of skill in the relevant art. For 
example, to identify genes involved in colon cancer, the 
gastroenterologist or endoscopist may collect healthy and 
diseased biopsy samples using an endoscope. Common sense is 
the guiding principle here. The VGID^^ method will provide 
best results where normal and diseased samples are 
systematically and thoroughly defined using objective 
criteria. 



5.4.2. CELL CULTURE 

When using tine VGID^^^ method with cell lines as input 
phenotypes, utmost care is advised. The specific conditions 
used for culturing will profoundly influence gene expression 
in virtually all cell lines. For example, the steroid 
hormone aldosterone influences the expression of genes 
I important for salt absorption by epithelia, such as the A6 

^4 cell line derived from Xenapus iaevis. The concentration of 

hormones and growth factors may vary over a broad range in 
media supplements commonly used (e.g. fetal calf serum or 
newborn calf serum) . Therefore, careful attention should be 
paid to the control of such variables. If problems arise, 
consideration should be given to the reservation of specific 
lots for all ingredients used. Further, chemical analyses of 
specific components may also be required as part of the 
standardization process. This concern over control of growth 
conditions is not limited to hormones and growth factors. 
Gene expression may be influenced by such basic parameters as 
length of time between passage, incubation temperature, pH, 
2 5 and the like. 

Accordingly, gene identification with the VGID^^ method 
1 using cell lines as input sources will be optimized and 

enhanced by careful attention to defining, and maintaining 
constant, the cell culture conditions associated with the 
phenotype-of -interest. This is equally true for the 
3^ phenotype or pheno types to be compared (i.e. subtracted). 
The culture of animal cells has been well described by 
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numerous references in the literature. The literature search 
conducted in choosing input phenotypes should be focused, in 
part, on defining the optimum cell culture conditions. For a 
broad overview of cell culture techniques and other relevant 
considerations, see Freshney, R.I., 1994, Culture of Animal 
Cells, A Manual of Basic Technique, 3d Edition, John Wiley & 
Sons, Inc., New York, New York. 



5»5. TROUBLESHOOTING THE VGID^'^ METHOD 

If a given phenotype-of-interest is initially resistant 
to the above-described approaches for using the VGID^^ method 
to identify a gene-of -interest , the following troubleshooting 
discussion may be helpful. A resistant phenotype-of-interest 
may be indicated by the identification of no genes, or the 
identification of too many genes (e.g. over 100) , in the 
appropriate pool under the experimental design chosen. 
Consider a case where an initial screen does not identify a 
genetic component associated with a given phenotype-of- 
interest. In this instance, careful attention should be paid 
to redefining the nucleic acid populations defined by the 
input phenotypes. For example, a synergistic effect between 
one or more genes and an environmental factor may be required 
for manifestation of the phenotype-of-interest. In this 
instance, it is desirable to identify and control any 
environmental factor present. In this way, a weak genetic 
determinant for a given phenotype-of-interest may be 
strengthened by careful modification of the criteria for 
inclusion in a phenotypic group. 

A variety of biological assays may also be used to 
further define a phenotypic group. Examples of such assays 
are set forth in the Section immediately below. 

5.6. ASSAYS FOR PHENOTYPE SELECTION 

Enzymatic and receptor-based biological assays may be 
used to further define a phenotype which is initially 
resistant to gene identification with the VGID'^'^ method- Such 



- 67 - 



wo 99/36575 



PCT/US99/01037 



definition is directed toward exclusion of individuals from a 
population which may not contribute to the genotype and 
which, therefore, would be beneficial to exclude from the 
gene identification assay. The eventual therapeutic use(s) 
resulting from the gene identification may serve as a guide 
^ to selection of relevant biological assays known in the art. 
For example, the bioassays selected for further definition of 
the phenotype of schizophrenia might involve a panel of 
central nervous system receptors implicated in that disease. 
There are many sources available which describe enzymatic or 
receptor assays. One example is the Methods In Enzymology 
series published by Academic Press. One skilled in the art 
would know what assays are most appropriate for defining the 
input phenotype* 

For example, for using the VGID™ method on a 
neurological disorder with a genetic component, relevant 
bioassays might include assays for activity of adrenergic 
receptors, cholinergic receptors, dopamine receptors, GABA 
receptors, glutamate receptors, monoamine oxidase, nitric 
oxide synthetase, opiate receptors, or serotonin receptors. 
For cardiovascular disorders, appropriate assays may include 
adenosine Ai receptors, adrenergic receptors (including cXi, 
0(2, , angiotensin I inhibition, platelet aggregation, ion 

2® channel blockade (e.g. calcium channels, chloride channels) , 
cardiac arrhythmia measurement, blood pressure, heart rate, 
contractility or hypoxia. For a metabolic disorder the 
following bioassays may be used: serum cholesterol, serum 
HDL, serum HDL/ cholesterol ratio, HDL/LDL ratios, serum 
glucose, kaluresis, saluresis, or urine volume change. For 

25 an allergic or inflammation disorder the following bioassays 
may be used: Arthur •s reaction, passive cutaneous 
anaphylaxis, bradykinin Bo, tracheal contractility, histamine 
H, antagonism, carrageenan affects on macrophage migration, 
leukotriene D4 antagonism, neurokinin NKi antagonism, or 
cytokine assays (e.g. the inter leukins or macrophage 
inhibitory proteins) . For gastrointestinal disorders the 
following bioassays may be used: cholecystokinin CCK^ 
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antagonism, cholinergic antagonism, gastric acidity, or 
serotonin 5-HT3 antagonism. The above listings merely provide 
exemplary assays. One skilled in the art would be able to 
choose a relevant bioassay or collection of bioassays for use 
in defining a phenotype. 

5.7. DISEASES, DISORDERS, AND OTHER PHENOTYPES 

The various phenotypes for which genes may be identified 
using the VGID™ method include, but are not limited to, any 
of the following disorders, diseases and phenotypes. 
Examples of disease states include the following: acquired 
immunodeficiency syndrome (AIDS) , angina, arteriosclerosis, 
arthritis, asthma, high or low blood pressure, bronchitis, 
cancer, cholesterol ^.r^balance, cerebral circulatory, clotting 
disorder, disturbance, cirrhosis, depression, dermatologic 
disease, diabetes, diarrhea, diuresis, dysmenorrhea, 
:^ dyspepsia, emphysema, gastrointestinal distress, hemorrhoids, 

hepatitis, hypertension, hyperprolactinemia , 
immunomodulation, resistance to bacterial infection, 
resistance to viral infection, inflammation, insomnia, 
lactation, lipidemia, migraine, pain prevention or 
management, peripheral vascular disease, platelet 
aggregation, premenstrual syndrome, prostatic disorder, 
2^ elevated ^.riglycerides, respiratory tract infection, 

retinopathy, sinusitus, rheumatic disease, impaired wound 
healing, tinnitus, urinary tract infection and venous 
insufficiency. 

Other phenotypes include, but are not limited to, 
#i cardiovascular disorders, nervous system disorders, enhancing 

25 itiemory, hypercholesterolemia, immune system stimulation, 

anti-inflammatory, antipyretic, analgesic, slowing the aging 
> process, accelerated convalescence, anemia, indigestion, 

impotence and menstrual disorders. 

Preferred phenotypes include, but are not limited to, 
plant resistance phenotypes (e.g. resistance to herbicides or 
3^ insect predators), microorganism resistance phenotypes (e.g. 
resistance to antibiotics), cancer (e.g. breast, prostate), 
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osteoporosis, obesity, type II diabetes, and pr ion-related 
diseases (e.g. bovine spongiform encephalitis, Creutzfeldt- 
Jakob disease) . 

5.8. LINKING OLIGONUCLEOTIDES TO SPECIFIC BINDING 
S LIGANDS 

The present invention embodies a method for the use of 
identification tags (i.e. specific binding ligands which are 
recognized by specific receptors) to facilitate the 
identification and isolation of sequences of interest from a 
complex mixture. In order to sort unknown sequences specific 
to a cell or a given phenotype from a mixture of sequences of 
different origins, it is important to be able to identify 
sequences from various sources. One method to label each 
different source of cDNA with a unique identification tag is 
accomplished by using labeled PGR oligos. 

The PGR oligonucleotide primers utilized to generate the 
raw material used in the VGID^'^ assays are labeled with a 
specific binding ligand. The labeling involves attaching the 
ligand to the oligonucleotide in a stable manner. In a 
preferred embodiment, the ligand is attached to the primer 
via covalent bonding. Methods for attaching the primer to 
the ligand are well known in the art* Various 

oligonucleotide labels include biotin, avidin or streptavidin 
and their derivatives, lectin, carbohydrate, peptide, hapten, 
or immunological material. 

Oligonucleotides may be labeled with a wide variety of 
labels for use in the various embodiments of the invention. 
For example, European Patent Publication No. EP 0370 694 A2 , 
2^ entitled, "Diagnostic Kit and Method Using a Solid Phase 
Capture Means For Detecting Nucleic Acid", by Burdick and 
Oakes, publication date May 30, 1990, discloses methods of 
linking labels to oligonucleotides. 

In a preferred embodiment, the oligonulceotides are 
labeled with peptides. Methods of attaching peptides to 
oligonucleotides are well known to those with ordinary skill 
in the art, e.g., see, l) Preparation and characterization of 
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antisense oligonucleotide-peptide hybrids containing viral 
fusion peptides* Soukchareun et al., 1995, Bioconjug. Chem. 
6(1), 43-53; 2) Preparation of oligonucleotide-peptide 

'•^ conjugates, Tung, et al-, 1991, Bioconjug. Chem, 2 (6), 464- 

465; 3) Template-directed ligation of peptides to 
^ oligonucleotides. Bruick et al., 1996, Chem. Biol. 3(1), 49- 

|| 56; 4) Dual-specificity interaction of HIV-1 TAR RNA with Tat 

^ peptide-oligonucleotide conjugates. Tung et al., 1995, 

Bioconjug. Chem. 6(3), 292-295; 5) Synthesis and Enzymatic 
Stability of Phosphodiester-Linked Peptide-Olignonucleotide 
Hybrids. Robles et al., x?97 , Bioconjug. Chem. 8(6), 785-788 

-.^ ; and 6) Covalent Protein-Oligonucleotide Conjugates for 

Efficient Delivery of Antir3nse Molecules. Rajur et al., 
1997, Bioconjug. Chem. 8(6), 935-940, 

Oligonucleotides linked to various peptides for use in 
the methods of this invention may be obtained for example, 

^ from Cybergene S.A. (11 rue Claude Bernard, zl nord, 35400, 

4 saint Mallo, France) and Glen Research (22825 Davis Drive, 

Sterling, Virginia 2 0164) . Further information from Glen 
Research can be obtained through their web site 
(www.glenres.com) . 

One specific method for linking a peptide to an 
oligonucleotide recommended by Glen Research is as follows 
2^ (see also, www.glenres.com) . A heterobif unctional 

crosslinking reagent is used to link a synthetic peptide 
having an N-terminal lysine residue to a 5 ' -thiol-modif led 
oligonucleotide. Such a crosslinking reagent is iM-maleimido- 
6-aminocaproyl- (2 * -nitro, 4 '-sulfonic acid) phenyl ester 

^ (mal-sac-HNSA) . The sodium salt of mal-sac-HNSA is available 

25 from Bachem Bioscience. Conveniently, reaction of the mal- 
sac-HNSA crosslinker with an amino group releases a dianion 
i phenolate (i.e. l-hydroxy-2-nitro-4-benzene sulfonic acid) . 

This dianion phenolate is also a yellow chromophore. The 
chromophore feature provides (i) a means for quantifying the 
extent of completion of the coupling reaction (where greater 
yellow color intensity corresponds to a more complete 
coupling reaction) , and (ii) an aid in monitoring the extent 
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of separation of an activated peptide (i.e. a peptide 
crosslinked to mal-sac-HNSA and ready for contacting with a 
5 • -thiol-modif ied oligonucleotide) from free crosslinking 

t| reagent during gel filtration. 

The specific steps employed when using a mal-sac-HNSA 
^ crosslinker may be as follows. First, a peptide is 

g synthesized having an N-terminal lysine* Alternatively, a 

^ peptide having an internal lysine may be used since the 

lysine epsilon amino group is actually more reactive than the 
lysine alpha amino group. Second, an oligonucleotide is 
synthesized having a S'-thi^^l group using methods known in 
the art. Third, the peptide is reacted with an excess of 
mal-sac-HNSA in a sodium phosphate buffer (pH 7.1), Fourth, 
the peptide-mal"sac conjugate is separated from free 
crosslinker and the buffer is exchanged to sodium phosphate 
(pH 6) using a gel filtration column (e.g. NAP-5, Pharmacia, 

^ Uppsala, Sweden) . Fifth, a thiol-modif ied oligonucleotide is 

activated, desalted and buffer-exchanged to sodium phosphate 
(pH 6) on a gel filtration column. Sixth, the activated 
peptide is reacted with the thiol-modif ied oligonucleotide. 
Finally, the peptide-oligonucleotide conjugate is purified by 
ion exchange chromatography (e.g. Nucleogen DEAE-500-10 or 
equivalent) . The elution order from the ion exchange column 
2^ is as follows: free peptide first, peptide-la]Deled 
oligonucleotide next, and free oligonucleotide last. 
In a preferred embodiment, the peptide labeled 
oligonucleotides are recognized by specific receptors, such 
as antibodies, to sort and isolate particular nucleic acids 

^ from a complex mixture. Given that a peptide-labeled 

25 oligonucleotide primer luay be subjected to high temperatures 
during PGR, in a preferred embodiment, the peptide is 

I' sufficiently short (i.e. no more than five amino acids) to 

resist irreversible denaturation. In addition, the peptides 
can be made resistant to different classes of proteases by 
avoiding the inclusion of protease-sensitive peptide links 
such as those that involve serine and /or threonine. A 
number of procedures have been used to add a single primary 
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aliphatic amino group to oligonucleotides. See e.g. Agrawal 
et al. , 1986, Nucleic Acids Res. 14, 6227-6245; Chollet et 
al-, 1985, Nucleic Acids Res. 13, 1529-1541; Wachter et al. , 
1986, Nucleic Acids Res. 14, 7985-7994; Sproat et al., 1987, 

J Nucleic Acids Res. 15, 6181-6196; Li et al. , 1987, Nucleic 

^ Acids Res. 15, 5275-5286; and Smith et al . , 1985, Nucleic 

§ Acids Res. 13, 2439-2502. Various methods may be used to 

?| synthesize oligonucleotides containing multiple amino groups 

attached to the oligonucleotide through a linker arm. 
Haralambidis et al. , 1990, Nucleic Acids Res. 18, 493-499; 
Haralambidis et al. , 1987, Nucleic Acids Res. 15, 4857-4876; 
10 Ruth et al., 1985, DNA 4, 93; Ruth, 1984, DNA 3, 123; Draper, 
1984, Nucleic Acids Res. 12, 989-1002. Attachment of a 
ligand such as biotin to the oligonucleotide is described in 
Kempe et al. , 1985, Nucleic Acids Res. 13, 45-57. The use of 
an alkylating intercalation moiety as an attachment is 

I described in U.S. Patent No. 4,582,789. Other standard 

15 peptide coupling methods and derivatized oligonucleotide 
methods can also be used (see e.gr. EPA 0 370 694). 

These standard procedures from the above references are 
incorporated herein by reference in their entireties. 



5.9. ANTIBODIES AND DERIVATIVES THEREOF 

In a preferred embodiment antibodies may be uf=^ed to 
specifically recognize one or more peptide-labeled 
oligonucleotides used to label a plurality of nucleic acid 
populations (e.g. cDNA libraries) . Such antibodies include 
but are not limited to polyclonal, monoclonal, humanized or 
chimeric antibodies, single chain antibodies, Fab fragments 
and F(ab')2 fragments, fragments produced by a Fab expression 
library, anti-idiotypic (anti-Id) antibodies, and epitope- 
binding fragments of any of the above. Such antibodies may 
be used as ligands in one of the screening steps in the 
present invention. 

Polyclonal antibodies which may be used in the methods 
of the invention are heterogeneous populations of antibody 
molecules derived from the sera of immunized animals. 
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Various procedures well known in the art may be used for the 
production of polyclonal antibodies to an antigen-of- 
interest. For example, the production of polyclonal 
antibodies, various host animals can be immunized by 
injection with an antigen of interest or derivative thereof, 
^ including but not limited to rabbits, mice, rats, etc. 

Various adjuvants may be used to increase the immunological 
response, depending on the host species, and including but 
not limited to Freund ' s (complete and incomplete), mineral 
gels such as aluminum hydroxide, surface active substances 
such as lysolecithin, pluronic polyols, polyanions, peptides, 
oil emulsions, keyhole limpet hemocyanins, dinitrophenol , and 
potentially useful human adjuvants such as BCG (bacille 
Calmette-Guerin) and corynebacterium parvum. Such adjuvants 
are also well known in the art. 

Monoclonal antibodies which may be used in the methods 
of the invention are homogeneous populations of antibodies to 

^5 a particular antigen. A monoclonal antibody (mAb) to an 
antigen-of-interest can be prepared by using any technique 
known in the art which provides for the production of 
antibody molecules by continuous cell lines in culture. 
These include but are not limited to the hybridoma technique 
originally described by Kohler and Milstein, 1975, Nature 

2® 256, 495-497, and the more recent human B cell hybridoma 

technique (Kozbor et al,, 1983, Immunology Today 4, 72), and 
the EBV"hybridoma technique (Cole et al., 1985, Monoclonal 
Antibodies and Cancer Therapy, Alan R. Liss, Inc* , pp. 77- 
96) , Such antibodies may be of any immunoglobulin class 
including IgG, IgM, IgE, IgA, IgD and any subclass thereof. 
The hybridoma producing the mAbs of use in this invention may 
be cultivated In vitro or in vivo. 

The monoclonal antibodies which may be used in the 
methods of the invention include but are not limited to human 
monoclonal antibodies or chimeric human-mouse (or other 
species) monoclonal antibodies. Human monoclonal antibodies 
may be made by any of numerous techniques known in the art 
(e.gr., Teng et al., 1983, Proc. Natl. Acad. Sci , U.S.A. 80, 
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7308-7312; Kozbor et al*, 1983, Immunology Today 4, 72-79; 
Olsson et al., 1982, Meth. Enzymol. 92, 3-16). 

Further, humanized monoclonal antibodies may be used. 
Briefly, humanized antibodies are antibody molecules from 

v:^ non-human species having one or more complementarily 

^ determining regions (CDRs) from the non-human species and a 

f| framework region from a human immunoglobulin molecule. 

f Various techniques have been developed for the production of 

humanized antibodies (see e.g.. Queen, U.S. Patent No. 
5,585,089, which is incorporated herein by reference in its 
entirety) . An immunoglobulin light or heavy chain variable 
' region consists of a "framework" region interrupted by three 

hypervariable regions, referred to as complementarily 
determining regions (CDRs) . The extent of the framework 
region and CDRs have been precisely defined (see, Rabat et 
al. , 1983, "Sequences of Proteins of Immunological Interest", 

% U.S. Department of Health and Human Services). 

A chimeric antibody is a molecule in which different 
portions are derived from different animal species, such as 
those having a variable region derived from a murine mAb and 
a human immunoglobulin constant region- Techniques have been 
developed for the production of "chimeric antibodies" 
(Morrison et al., 1984, Proc. Natl. Acad. Sci. U.S.A. 81, 
20 6851-6855; Neuberger et al., 1984, Nature, 312, 604-608; 
Takeda et al., 1985, Nature, 314, 452-454) by splicing the 
genes from a mouse antibody molecule of appropriate antigen 
specificity together with genes from a human antibody 
molecule of appropriate biological activity. 

Alternatively, techn-^ ques described for the production 
25 of single chain antibodies (U.S. Patent No. 4,946,778; Bird, 
1988, Science 242, 423-426; Huston et al. , 1988, Proc. Natl. 
Acad. Sci. USA 85, 5879-5883; and Ward et al . , 1989, Nature 
334, 544-546) can be adapted to produce single chain 
antibodies against the peptide portion of the peptide-labeled 
oligonucleotide nucleotides useful in the methods of the 
invention. Single chain antibodies are formed by linking the 
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heavy and light chain fragments of the Fv region via an amino 
acid bridge, resulting in a single chain polypeptide. 

Antibody fragments which recognize specific epitopes may 
be generated by known techniques. For example, such 
fragments include but are not limited to: the F(ab')2 
^ fragments which can be produced by pepsin digestion of the 
antibody molecule and the Fab fragments which can be 
generated by reducing the disulfide bridges of the F(ab')2 
fragments. Alternatively, Fab expression libraries may be 
constructed (Huse et al., 1989, Science, 246, 1275-1281) to 
allow rapid and easy identification of monoclonal Fab 

^0 fragments with the desired specificity. 

Antibodies to the peptide portion of a peptide-labeled 
oligonucleotide can, in turn, be utilized to generate anti- 
idiotype antibodies that "mimic" the peptide, using 
techniques well known to those skilled in the art. (See, 
e.g., Greenspan & Bona, 1993, FASEB J 7(5), 437-444; and 

^5 Nissinoff, 1991, J. Immunol. 147(8), 2429-2438). For 
example, antibodies which bind to the peptide and 
competitively inhibit the binding of peptide to its receptor 
can be used to generate anti-idiotypes that "mimic" the 
peptide receptor and, therefore, bind the peptide. 

A molecular clone of an antibody to an antigen-of- 

2 0 interest can be prepared by many well known techniques. 

Recombinant DNA methodology (see e.g., Maniatis et al., 1982, 
Molecular Cloning, A Laboratory Manual, Cold Spring Harbor 
Laboratory, Cold Spring Harbor, New York) may be used to 
construct nucleic acid sequences which encode a monoclonal 
antibody molecule, or antigen binding region thereof. 

2^ Antibody molecules may be purified by many well known 

techniques, e.g., immunoabsorption or immunoaf f inity 
chromatography, chromatographic methods such as HPLC (high 
performance liquid chromatography) , or a combination thereof, 
etc. 

The methods of antibody production and use employed 
2^ herein can, for example, be such as those described in Harlow 
and Lane (Harlow, E. and Lane, D. , 1988, Antibodies: A 
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Laboratory Manual, Cold Spring Harbor Laboratory Press, Cold 
Spring Harbor, New York) , which is incorporated herein by 
reference in its entirety. 

The single-letter amino acid codes which correspond to 
;^ the three-letter amino acid codes of the Sequence Listing are 

^ set forth hereinbelow: A, Ala; R, Arg; N, Asn; D, Asp; B, 
I Asx; C, Cys; Q, Gin; E, Glu; Z, Glx; G, Gly; H, His; I, He; 

i L, Leu; K, Lys ; M, Met; F, Phe; P, Pro; S, Ser; T, Thr; W, 

Trp; Y, Tyr; and V, Val« 

Suitable antibodies for use with the methods of this 
■ invention include the followxng, available from Affinity 

Bioreagents, Inc., 79, rue des Morillons, 75015, Paris, 
France . 

1) Catalog No. PA 1-047 (affinity-purified rabbit 

IgG) . The corresponding peptide recognized by 

I this antibody is KFSREKKAAKT (SEQ ID N0:71). 

^ 15 

2) Catalog No. PA 1-039 (affinity-purified rabbit 
immunogobins ) . The corresponding peptide 
recognized by this antibody is DQKRYHEDIFG 
(SEQ ID NO: 72) . 

20 3) Catalog No. PA 1-036 (purified rabbit IgG) . 

The corresponding peptide recognized by the 
antibody is DLKEEKDINNNVKKT (SEQ ID NO: 73) . 

4) Catalog No. PA 1-014 (purified rabbit 
^ antibody) . The corresponding peptide 

2S recognized by this antibody is CTGEEDTSE (SEQ 

ID NO: 74) . 

5) Catalog No. PA 3-013 (affinity purified IgG) . 
The corresponding peptide recognized by this 
antibody is PEETQTQDQPM (SEQ ID NO: 75). 

30 
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6) Catalog No. PA 1-815 (rabbit anti-serum) . The 
corresponding peptide recognized by this 
antibody is QKSDQGVEGPGAT (SEQ ID NO:76). 

■\ 

7) Catalog No. PA 3-034 (rabbit polyclonal serum 
^ IqG) . The corresponding peptide recognized by 

p this antibody is DIGQSIKKFSKV (SEQ ID NO:77). 

I This polyclonal antibody will also recognize 

QRADSLSSHL (SEQ ID No:78). 

In addition, antibodies for use with the methods of this 
10 invention may be obtained from Medical & Biological 
Laboratories Co., Ltd., 440 Arsenal Street, Watertown, 
Massachusetts 02171, U.S.A. 

These include the following: 

1) Code No. 561 (Rabbit IgG from anti-serum) . 
g The corresponding peptide recognized by this 

antibody is YPYDVPDYA (SEQ ID NO: 79). 



15 



2) Code No. 562 (Rabbit IgG from anti-serum). 

The corresponding peptide recognized by this 
antibody is EQKLISEEDL (SEQ ID NO:80), 

2^ 3) Code No, 563 (Rabbit IgG from ant i -serum ) . 

The corresponding peptide recognized by this 
antibody is YTDIEMNKLGK (SEQ ID NO:Sl) . 



5.10. AKTIBODY COLUMNS FOR SORTING NUCLEIC 

2 5 ACIDS 

An antibody specific to a given peptide label which is 
used to identify nucleic acids arising from different sources 
may be bound to a solid phase support or carrier material to 
facilitate separation using various techniques well known in 
the art . 

3^ Here, the "solid phase support or carrier" can be any 

support capable of binding an antigen or an antibody- Well- 
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known supports or carriers include glass, polystyrene, 
polypropylene, polyethylene, dextran, nylon, amylases, 
natural and modified celluloses, polyacrylamides , gabbros , 
and magnetite. The nature of the carrier can be either 
soluble or insoluble for the purposes of the present 
5 invention. The solid phase support or carrier material can 
have virtually any possible structural configuration so long 
as the coupled molecule is capable of binding to an antigen 
or antibody. Thus, the support configuration can be 
spherical, as in a bead, or cylindrical, as in the inside 
surface of a test tube, or the external surface of a rod. 
Alternatively, the surface can be flat such as a sheet, test 
strip, etc. Preferred supports include polystyrene beads. 
Those skilled in the art will know many other suitable 
carriers for binding antibody or antigen, or will be able to 
ascertain the same by use of routine experimentation. 

For example, the solid phase support or carrier material 
can be beads, polymeric particles, or other materials, so 
long as the coupled molecule is capable of binding to an 
antigen or antibody. Such solid phase supports are readily 
apparent to one of ordinary skill in the art. Particularly 
useful solid phase support or carrier materials are polymeric 
beads having an average particle size of from 0.1 to 10 
2 0 ^meters- 

Antibodies against specific peptides can be bound to any 
of the above described solid phase support or carrier 
material in different ways. Two of the most preferred 
attachments are adsorption and covalent attachment. 

The methods for making and using the antibody columns 
2^ employed herein can, for example, be such as those described 
in Harlow and Lane (Harlow and Lane, 1988, Antibodies: A 
Laboratory Manual, Cold Spring Harbor Laboratory Press, Cold 
Spring Harbor, New York) , which is incorporated herein by 
reference in its entirety. 

Further, examples of suitable antibodies and peptide 
^0 labels may be found, for example, in U.S. Application No. 
09/174,328, entitled "METHODS FOR MANIPULATING COMPLEX 
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NUCLEIC ACID POPULATIONS USING PEPTIDE-LABELED 
OLIGONUCLEOTIDES", by Iris and Pourny (Attorney Docket No. 
9408-025), filed October 16, 1998, which is incorporated by 
reference herein in its entirety . 

5 5.11. DETECTIOK OF ANTIBODIES AGAINST PEPTIDE- 

g LABELED OLIGONUCLEOTIDES 

■ff, Antibodies which recognizes peptide-labeled 

oligonucleotides may also be detectably labeled using any 
method known to one skilled in the art. Many such xnethods 
are known. For example, one of the ways in which an anti- 
peptide antibody can be detectably labeled is by linking the 
same to an enzyme and use in an enzyme immunoassay (EIA) , 
"The Enzyme Linked Imirunosorbent Assay (ELISA) " Voller, 1978, 
Diagnostic Horizons 2, 1-7; Voller et al, , 1978, J. Clin. 
Pathol, 31, 507-520; Butler, 1981, Meth. Enzymol. 73, 482- 
523; Maggio, 1980, Enzyme Immunoassay, CRC Press, Boca Raton, 
FL, ; Ishikawa et al., 1981, Enzyme Immunoassay, Kgaku Shoin, 
Tokyo) . The enzyme which is bound to the antibody will react 
with an appropriate substrate, preferably a chromogenic 
substrate, in such a manner as to produce a chemical moiety 
which can be detected, for example, by spectrophotometric, 
fluorimetric or by visual means. Enzymes which can be used 

2 0 to detectably label the antibody include, but are not limited 
to, malate dehydrogenase, staphylococcal nuclease, delta-5- 
steroid isomerase, yeast alcohol dehydrogenase, alpha- 
glycerophosphate, dehydrogenase, triose phosphate isomerase, 
horseradish peroxidase, alkaline phosphatase, asparaginase, 
glucose oxidase, beta-galactosidase, ribonuclease , urease, 

2^ catalase, glucose-6-phosphate dehydrogenase, gluccamylase and 
acetylcholinesterase. The detection can be accomplished by 
j color imetric methods which employ a chromogenic substrate for 

the enzyme. Detection may also be accomplished by visual 
comparison of the extent of enzymatic reaction of a substrate 
in comparison with similarly prepared standards • 

^0 Detection may also be accomplished using any of a 

,v variety of other immunoassays. For example, by radioactively 
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labeling the antibodies or antibody fragments, it is possible 
to detect peptide portion of the peptide-labeled 
oligonucleotide through the use of a radioiiniaunoassay (RIA) 
(see, for example, Weintraub, 1986, Principles of 
Radioimmunoassays, Seventh Training Course on Radioligand 
Assay Techniques, The Endocrine Society, which is 
incorporated by reference herein) - The radioactive isotope 
can be detected by such means as the use of a gamma counter 
or a scintillation counter or by autoradiography. 

It is also possible to label the antibody with a 
fluorescent compound* When the f luorescently labeled 
antibody is exposed to light of the proper wavelength, its 
presence can then be detected due to fluorescence. Among the 
most commonly used fluorescent labeling compounds are 
fluorescein isothiocyanate , rhodamine, phycoerythrin, 
phycocyanin, allophycocyanin, o-phthaldehyde and 
f luorescamine • 

The antibody can also be detectably labeled using 
fluorescence emitting metals such as ^^^Eu, or others of the 
lanthanide series. These metals can be attached to the 
antibody using such metal chelating groups as 
diethylenetriaminepentacetic acid (DTPA) or 
ethylenediaminetetraacetic acid (EDTA) . 

The antibody also can be detectably labeled by coupling 
it to a chemi luminescent compound- The presence of the 
chemiluminescent-tagged antibody is then determined by 
detecting the presence of luminescence that arises during the 
course of a chemical reaction. Examples of particularly 
useful chemiluminescent labeling compounds are luminol, 
isoluminol, theromatic acridinium ester, imidazole, 
acridinium salt and oxalate ester. 

Likewise, a bioluminescent compound may be used to label 
the antibody of the present invention. Bioluminescence is a 
type of chemi luminescence found in biological systems in, 
which a catalytic protein increases the efficiency of the 
chemiluminescent reaction. The presence of a bioluminescent 
protein is determined by detecting the presence of 
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luminescence. Important bioluminescent compounds for 
purposes of labeling are green fluorescent protein, 
luciferin, lucif erase and aequorin. 

6. EX7VMPLE: USE Or xrfE VGID^^ METHOD TO IDENTIFY hDinP 
GENES 

6.1. INTRODUCTION 

Two human DlnP (hDinP) genes have been identified using 
the VGID^M method applied to cell lin-^ sources, as described 
hereinbelow. The VGIDSw approach employed was that described 
hereinabove for cell line amples (see FIG. 1) . The aim of 
this example was to isolate any human homologue(s) of the 
bacterial DinP gene. In bacteria and yeast, the product of 
the DinP gene (i.e. DinP) is central to the inducible DNA 
damage repair pathway known as the "SOS repair system. «• 
Although this DNA repair pathway is known to exist in man, 
:| inducibility has never been demonstrated. The components of 

'^^ this pathway are known to be directly involved in the 

appearance of secondary cancers following radiation therapy 
or chemotherapy in humans. Nevertheless, the human genes 
encoding the components of the pathway have not been 
previously identified. The VOIDS'^ example described below 
isolated, in less than three weeks, a total of five 
independent human cDNA clones. The clones were analyzed by 
DNA sequencing, translation into all six reading frames, and 
protein database search (BLASTX) . Translations of all five 
clones displayed high amino acid sequence homology to the 
bacterial DinP protein, thereby confirming the identification 
^1 c)f human homologues of bacterial DinP. It should be well 

25 noted that low stringency hybridization of a bacterial DinP 
^ probe to a human library would not have identified these 

- clones since the nucleic acid sequence homology is too low to 

permit this type of screen. 
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6,2. MATERIALS AND METHODS 

Cell lines (phenotvpe selection) > For isolation of 
hDinP transcripts toy MutS-mediated selective subtraction, two 
input cell lines differing in their capacity to effect DNA 
^ repair were utilized. The phenotype-of -interest (i.e. human 

^ DinP activity) was provided by presumed "high expressor" 
i cells from a defined lymphoblastoid clonal line 

i (lymphoblasts) (i.e. cell line #1 in FIG. 1). These cells 

were harvested at a time corresponding to the apex of their 
In vitro growth curve (i.e. 84 hours after initiation of the 
growth phase). The competitor-cDNA providers (i.e. cell line 
#2 in FIG. 1) were hepatocytes grown in standard medium for 
60 hours before harvesting. These two cell lines originated 
from different sources and therefore have a very low 
probability of consanguinity (i.e. of having a common 
ancestor) . However, the fast growth rate of these cell lines 
I is associated with the possibility of substantial levels of 

^ mutation acquisition. 

mRNA extraction and cDNA synthesis . These procedures 
were performed using Gibco-BRL "Trizol" kits (mRNA 
preparation) and Promega "Universal Riboclone" kits (cDNA 
synthesis) according to manufacturers* protocols. 
Synthesized cDNA was size-fractionated by electrophoresis in 
20 an agarose gel (0.8%); fragments ranging from 300 to 6O0 base 
pairs were excised from the gel. The cDNA was extracted from 
the gel slices using Promega Gel-Clean kits according to the 
manufacturer * s protocol . 

cDNA libraries . A library for each cell line was 
^ constructed by blunt-end ligation of the size-selected cDNA 

2^ into the QuanTox"^*^ Blunt (Quantum Biotechnologies) plasmid 
vector. Ligation products were transformed into DH5a 
i competent E. coll cells. These cells were grown overnight in 

ampicillin-containing liquid medium. Cells were next 
harvested and the insert-containing plasmid vectors were 
recovered using Qiagen plasmid purification kits. 

Amplification of cDNA inserts . The inserts present in 
the cDNA library obtained from the lymphoblasts were PGR 
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amplified using oligonucleotide primers specific to the 
vector's cloning-cassette. The polymerase enzymes used were 
Pfu DNA polymerase (1.5 U/100 ^1 reaction) and the Stoffel 
fragment of DNA polymerase I (0.5 U/100 ^1 reaction). The 

^ cycling protocol used was: 97^c, 3 min; SS'^C, 5 sec; 70^c, 

5 1 min; then 93°C, 30 sec; 58°C, 5 sec; 70°C, l min for 

I 15 cycles. PGR products were purified over Qiagen columns. 

I The purified PGR products were heat-denatured at 98 for 

5 min, incubated at 65 for 2 0 min and cooled to room- 
temperature. 

The renatured and cooled PCR products (3 50 ng in 90 fil) 
were equilibrated in an equal volume of "2X reaction buffer" 
(40 mM Tris-HCl, pH 7.6; 0.02 mM EDTA; 10 mM MgCl.; 0.2 mM 
DTT) and exposed for 3 5 min to MutS adsorbed onto glass beads 
packed into a 0.5 ml Eppendorf tube perforated by a small 
hole at the bottom of the tube. During the incubation phase, 
I '^^^ Eppendorf tubes were placed in a horizontal position to 

increase the contact surface area between the beads and 
reannealed cDNA. The unbound reannealed PCR products left 
free in solution were recovered in the supernatant following 
centrifugation to pellet the beads (8000 X g; 30 sec) . This 
MutS-mediated trapping step was repeated twice more with 
fresh beads; the supernatant recovered at the end of this 
2^ operation and stored at 4°C until used. This supernatant 
contains only the transcripts structurally identical among 
all lymphoblasts in the phenotype-of -interest cell line. 

The procedure followed for isolation of transcripts 
structurally common to all hepatocytes was identical to that 
described above, except that the primers used for the PCR 
amplification (corresponding to vector sequences encoding T3 
and T7 promoters) were biotinylated at the 5' end* The final 
supernatant was also stored at 4''c until used. 
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Isolation of cDNA encoding hPinP . An aliquot of the 
stored supernatant from hepatocytes was then mixed with an 
aliquot of the stored supernatant from lymphoblasts in a 3 : i 
ratio (hepatocyteilymphoblast) . This mixture was denatured, 
reannealed, and exposed to MutS-coated beads, as above, to 
^ remove all mismatched heteroduplexes . 

The supernatant from this mixture was next exposed to 
streptavidin-coated beads (Dynabeads M-18 0, Dynal, used 
according to the manufacturer's protocol) in order to trap 
all non-mismatched homoduplex hybrids formed from one 
hepatocyte strand and one lymphoblast strand (i,e. 
transcripts structurally identical in hepatocytes and 
lymphoblasts) , as well as all remaining hepatocyte-specif ic 
transcripts. This trapping step was performed by incubating 
the supernatant recovered after the MutS binding reaction 
(150 Ml) with 150 Mi of dry streptavidin beads. 

Following recovery of the streptavidin bead supernatant, 
the beads were rinsed twice in IX reaction buffer to recover 
all unbound material. The washings were recovered by 
centrifugation, pooled with the streptavidin bead 
supernatant, and saved at 4''C, The pooled streptavidin bead 
supernatant, theoretically containing only lymphoblast- 
specific transcripts structurally identical in all 
lymphoblast cells, was then desalted and concentrated using 
Qiaex II DNA purification kits (Qiagen) as per the 
manufacturer ' s protocol . 

The purified material was blunt-ended by 3 • extension 
(DNA tailing kit, Boehringer-Mannheim) , purified ov^er Quiagen 
columns as above, and cloned into the QuanTox*^'^ Blunt vector 

^ as previously described. The twelve recombinant colonies 
obtained were then individually tested for the presence of 
inserts by: (i) PCR amplification; and (ii) hybridization at 
various stringencies with a PCR-generated, labeled fragment 
of the coll Din? gene. Under low stringency hybridization 
conditions (i.e. 40°C overnight in 3X SSC, IX Denhardt's 

^ solution, 20 mM sodium phosphate (pH 6.5), 10% dextran 

sulfate, 10 0 Mg/ml salmon sperm DNA; 3 washes in 3X SSC and 
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1% SDS at 40°C for 15 min each) , signals from all twelve 
clones were obtained, but the signals were slightly stronger 
for the five clones later identified by sequencing and 
computer analysis to be derived from two hDinP genes (see 
below) , By contrast, under medium stringency hybridization 
^ conditions (i.e. 50°C overnight in same buffer used for low 
stringency plus 25% deionized formamide; 3 washes in 2X BSC 
and 1% SDS at 50 °C for 15 min each), weak signals from all 
twelve clones were obtained without apparent differences in 
signal intensities. Finally, under high stringency 
hybridization conditions (i.e. eo^'C overnight in the same 
buffer used for medium stringency conditions; 3 washes in IX 
SSC and 1% SDS at 60°C for 15 min each), a complete absence 
of signal from all twelve clones resulted. These 
hybridization results suggest that at least five of the 
twelve clones isolated contain inserts. The results further 
suggest that one would not isolate any hDinP clones by simply 
screening a human library directly with a labeled fragment of 
the E. coll DinP gene; the hybridization signal in such a 
library screen would be indistinguishable from background. 

Of the twelve clones isolated by the method of the 
invention, five were sequenced which displayed the slightly 
stronger signal under high stringency conditions relative to 
the other seven* These five clones were next used as query 
sequences in individual BLASTX protein database searches 
after translation into all six reading frames (FIGs. 5-9) . 
The single-letter amino acid codes appearing in the computer 
analyses provided by the BLASTX searches (see FIGs. 5-9) 
which correspond to the three-letter amino acid codes of the 
Sequence Listing are set forth in Section 5.9 Supra. 



25 



6.3. RESULTS 

This VGID^M example isolated a total of five overlapping 
human cDNA inserts (see FIG. 4 for a map of overlapping 
regions) which appear in the Sequence Listing as SEQ ID 
N0s:l"5. BLASTX protein database search and a computer 
analysis was performed on each of the five identified 
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sequences after translation into all six reading frames (see 
FIGS. 5-9 for BLASTX results). The results revealed high 
amino acid sequence homology exclusively with the bacterial 
DinP protein and it's close relatives such as UV protection 
protein mucB (see FIG. 5A; . On the basis of overlapping 
sequences, these five inserts were assigned to two separate, 
homologous hDinP genes, as described below. 

Three of the five overlapping inserts (SEQ ID NOs:l-3) 
cover about half of the predicted length for a full-length 
hDlnP transcript after assembly into a composite sequence 
(SEQ ID N0:6) . The other wo inserts (SEQ ID N0s:4-5), which 
correspond to a cumulative length of 386 bases, also overlap 
with each other and with the composite sequence of SEQ ID 
NO: 6. However, these two inserts provide evidence for the 
existence of two hDinP genes, as further described below. 
This result is in agreement with other characterized human 
DNA repair genes, which are all known to be encoded by 
multiple genes. That SEQ ID NOs:4-5 represent transcripts 
derived from different genes encoding isoforms of hDinP is 
suggested by limited internal sequence divergence at 
positions 237-252 and 274-279. 

6.4 « DISCUSSION 

The two novel hDlnP genes identified above represent a 
significant advance in our understanding of human genes 
involved in DNA repair. Moreover, the new genes will be 
useful in the development of various prognostic tests, 
diagnostic tests, and therapeutic interventions for treatment 
of disease, especially cancer. This is true, in part, 
because DNA repair pathways have been so strongly connected 
to cancer-causing mechanisms (see e.g. Fishel et al., 1993, 
Cell 75(5), 1027-1038). 

The protein sequences encoded by the five human clones 
and their corresponding bacterial relatives are set forth in 
SEQ ID NOs:7-70. The search analyses for the five clones 
listed in SEQ ID N0s:l-5 are set forth in FIGS. 5-9, 
respectively. It is noteworthy that all five independent 



WO,99/36575 



PCT/US99/01037 



Clones encode a protein homologous to E. coli DinP; i.e, 
mainly hDinP clones (five of twelve) were identified by the 
VGID^M method in this experiment. This result dramatically 
demonstrates the high specificity for gene identification 

^ obtainable with the VGID^m method. This specificity is 

^ directly correlated with the well defined input phenotypes 
employed. Protein translation results for SEQ ID NO:i (#1) 

i listed in SEQ ID N0s:7-2 4. Protein translation results 

for SEQ ID NO: 2 (Tor-M) are listed in SEQ ID NOs: 25-29. 
Protein translation results for SEQ ID NO: 3 (# 3) are listed 
in SEQ ID NOs: 30-41 • Proteir translation results for SEQ ID 
NO:4 (*1) are listed in SEQ ID NOs: 42-58. Protein 
translation results for SEQ ID NO: 5 (*2) are listed in SEQ ID 
NOs:59-70. 



6.5. APPLICATION OF THE VGID®'^ METHOD TO A COHPLEX, 
I' MULTISTAGE SYSTEM - MULTIPLEX VOIDS'^ (VGGT®^) 

^' '^l^e VGIDSM method may be applied to complex, multistage 

systems, such as cancer. In cancer, several stages having 
different phenotypes can coexist, e.g., within a single 
biopsy specimen, or within primary cell cultures propagated 
therefrom. When applied to such a system, however, the VGID^m 
method requires performance of a large number of assays (i.e. 
2^ one unlabeled cDNA library and one biotin-lableled cDNA 

library for pairwise comparison of each stage with all other 
stages). Multiplex VGID^f^, also known as ValiGene Gene 
Trapping (VGGTSm) , offers an alternative approach for analysis 
of complex, multistage systems. 
^ VGGT^m does not require performance of a large number of 

25 pairwise comparisons. By contrast, the number of assays 
, performed is considerably reduced, thus saving the user time 

4 and money. VCGT^^^ (Multiplex VGIDSM) permits the simultaneous 

analysis of more than two phenotypes (represented by more 
than two cDNA libraries) by PGR labeling inserts from each 
library using primers having a unique label. In a preferred 
3° embodiment, the unique label is a peptide label. in another 
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preferred embodiment, the unique peptide label is recognized 
by an antibody specific for the label. 

In this way, cDNA fragments derived from each of any 
number of libraries subjected to a Multiplex VGIDS"^ analysis 
can be specifically identified and retrieved as desired by 
^ the user. Further, such identification and retrieval can be 
i at any point in the Multiplex VGID^^ analysis by virtue of the 

I unique library labels. For example, fragments from any 

number of cDNA libraries can be mixed, denatured, reannealed, 
and subjected to one or more rounds of Muts chromatography 
and/or antibody affinity chromatography, all without losing 
track of the phenotypic source (i.e. cDNA library) from which 
any particular fragment-of -interest originated. 

In a typical Multiplex VGID^*^ assay, reannealing occurs 
in the presence of high complexity (i.e. labeled fragments 
from more than two cDNA libraries are mixed, denatured and 
^ reannealed) . After reannealing, sorting of the various cDNAs 

^' is carried out using MutS chromatography and/ or antibody 

affinity chromatography as desired by the user. The labeling 
scheme permits the user to isolate any desired cDNA fragment 
for cloning, probe use, or further sorting as desired. 

In summary, peptide-labeled oligonucleotide PGR primers 
are used for differential labeling of cDNA fragments 
2^ originating from different libraries. The labels allow one 
to identify sequences common to more than two libraries and 
to isolate sequences specific to each library through 
chromatography. Further, the labels allow one to retrieve 
cDNA f ragments-of-interest from a mixture of nucleic acids of 
%f different origins. Overall, the Multiplex VGID^m process 

25 allows one to dif f erenrially label, sort and isolate 
expressed nucleotide sequences representing defined 
phenotypes present in complex mixtures. 
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6,5*1. EXAMPLE OF A COMPLEX SYSTEM - BREAST 

C3^CER 

One example of a complex system which may be subjected 
to the Multiplex VGID^^ method of the invention is breast 
cancer* In this example, congenic cell lines obtained from 
^ the same individual (HBL 100, HH9 , MCF-7 and MCF-7 ras) may 
^ be used to represent four different cancer stages (i»e. 

3 four different phenotypes) , as follows: 

(a) HBL 10 0 (pre-cancer stage) ; 
^. (b) HH9 (pre-metastatic stage, hormone-sensitive) ; 

(c) MCF-7 (metastatic stage, hormone-dependent) ; and 

(d) MCF-7 ras (aggressively metastatic stage, hormone- 
sensitive) . 

Analysis of this system by VGID^'^, comparing two 
phenotypes at a time, would require six different 
experiments. That is, one would compare (a) with (b) , (a) 
|i with (c) , (a) with (d) , (b) with (c) , (b) with (d) , and (c) 

with (d) . Further, in order to gain knowledge of metabolic 
pathways that might distinguish each stage, and to identify 
key elements (such as optimal intervention points) within a 
pathway-of -interest , cell growth of each cancer stage should 
be analyzed under different conditions. For example, four 
suitable conditions may be as follows: 

(a) Standard growth conditions (e.g*, cell culture 
medium containing fetal calf serum) ; 

(b) Non-steroid conditions (e.g. serum-free conditions; 
or serum-containing conditions where the serum has 
been treated to remove all steroids) ; 

.-^^ (c) Estradiol conditions (e.g. non-steroid conditions 

where a defined concentration of estradiol has been 
added back as the only steroid present) ; and 
S (d) Estradiol plus tamoxifen conditions (e.g. estradiol 

conditions where a defined concentration of 
tamoxifen has also been added) . 
For a complete analysis of this system using VGID^^ , 
where complete is defined as comparing four different 
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phenotypes to each other under four different conditions, one 
would perforin a total of twenty-four VGID^^ assays, 

6.5.2. ANALYSIS OF A COMPLEX SYSTEM USING 

MULTIPLEX VGID®"^ 

Each pairwise VGID®^ assay just described would provide a 
readout of the expressed genes distinguishing the two samples 
being analyzed. Twenty-four such pairwise comparisons could 
then be further compared among themselves. The efficiency of 
phenotype comparison in this way is thus hindered by the 
large number of independent VGID^^ assays that needs to be 
performed. 

An alternative approach is Multiplex VGIDSm. In the 
above example, HBL 10 0, HH9 , MCF-7 and MCF-7 ras would each 
be labeled with a different peptide label - 

It will be understood that, in the Multiplex VGIDSm 
process, the labels serve not only to identify the library- 
of -origin of a particular cDNA insert, but also to provide a 
means of specific retrieval. Using this approach in the 
above example, retrieval of cDNA fragments that characterize 
each cancer stage relative to the others and relative to the 
applied environmental conditions (i.e. cell culture growth 
conditions) would require many fewer assays* For example, 
four assays may be performed in which cDNA fragments from 
each of the four cell lines are compared (i.e. mixed, 
denatured, reannealed, and sorted by chromatography on MutS 
and/ or antibody columns) under each of the four growth 
conditions set forth above. Thus, a full analysis of the 
system may be performed in just four assays instead of the 
twenty-four required under the pairwise analysis approach. 

The invention described and claimed herein is not to be 
limited in scope by the specific embodiments herein disclosed 
since these embodiments are intended as illustration of 
several aspects of the invention. Any equivalent embodiments 
are intended to be within the scope of this invention. 
Indeed, various modifications of the invention in addition to 
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those shown and described herein will become apparent to 
those skilled in the art from the foregoing description. 
Such modifications are also intended to fall within the scope 
of the appended claims. Throughout this application various 
publications and patents are cited. Their contents are 
hereby incorporated by reference into the present application 
in their entireties. 
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We claim: 

1. A method for identifying one or more genes 
underlying a defined phenotype comprising the following steps 
^ in the order stated: 

^ (a) removing mismatched duplex nucleic acid molecules 

I formed from hybridization within each of a 

I plurality of source populations of nucleic acids; 

and 

^ (b) retaining mismatched duplex nucleic acid molecules 

formed from hybridisation among the plurality of 
source populations, 

the retained molecules in step (b) comprising the one or more 

genes underlying the defined phenotype. 



2. The method of Claim 1, wherein the plurality of 
source populations comprises at least one normalized cDNA 
library* 



3. The method of Claim 1, wherein the plurality of 
source populations comprises at least one linearized cDNA 
library. 

4. The method of Claim 1, wherein the plurality of 
source populations consists of DNA, the DNA of each of the 
source populations being labeled with a different label, and 
the hybridization in step (b) is carried out using an excess 
of labeled DNA from one or more source populations. 

5. The method of Claim 4, wherein the excess of 
labeled DNA is a three-fold excess. 



6. The method of Claim 1, wherein each of the source 
populations is derived from a cell line, 

7, A method for identifying one or more genes 
underlying a defined phenotype displayed by a cell or 
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individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived, comprising: 

(a) hybridizing insert DNA from the first cDNA library 
4^ with itself; 

^ (b) hybridizing insert DNA from each library of the 

I plurality of additional cDNA libraries with itself; 

j (c) contacting the DNA hybridized in step (a) with a 

first immobilized mismatch binding protein; 

(d) contacting each separate population of DNAs 
hybridized in step (b) individually with a second 
immobilized mismatch binding protein; 

(e) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(f) separating unbound DNA from bound DNA contacted 
individually in step (d) ; 

■| (g) labeling each separate population of the unbound 

DNA separated in step (f) with a distinguishable 
label capable of binding a partner molecule 
immobilized on a substrate; 

(h) hybridizing DNA separately labeled in step (g) with 
unbound DNA separated in step (e) ; 

(i) contacting DNA hybridized in step (h) with a third 
immobilized mismatch binding protein; 

(j) separating unbound DNA from bound DNA contacted in 
step (i); 

(k) contacting unbound DNA separated in step (j) with 
the partner molecule of each different label; and 
^ (1) separating unbound DNA from bound DNA contacted in 

25 step (k) , 

^ which unbound DNA separated in step (1) encodes one or more 

t identified genes underlying the defined phenotype. 

8. The method of Claim 7, wherein one or more of the 
cDNA libraries is normalized. 

30 



94 - 



BNSDOCtD: <WO_.9936575A1 _t„> 



wo 99/36575 



PCTAJS99/01037 



9. The method of Claim 7, wherein one or Bore of the 
cDNA libraries is linearized. 



10* The method of Claim 7, wherein labeling is carried 
m out by polymerase chain reaction using a 5 '-peptide labeled 



1 



^ primer. 



11. The method of Claim 7, wherein at least one 
immobilized partner molecule is an antibody, 

12. The method of Claim 11, wherein the antibody is an 
anti-peptide antibody. 

13. The method of Claim 7, wherein the hybridization in 
step (h) is carried out using an excess of labeled DNA. 

?| 14. The method of Claim 13, wherein the excess of 

^ labeled DNA is a three-fold excess. 



15. The method of Claim 7, wherein the first, second, 
or third immobilized mismatch binding protein is MutS. 

16. The method of Claim 1, wherein the defined 
phenotype is selected from the group consisting 

of a plant phenotype, a microorganism phenotype, and a 
pathologic phenotype. 



17. The method of Claim 16, wherein the defined 
phenotype is a pathologic phenotype that is selected from the 
25 group consisting of cancer, osteoporosis, obesity, type II 
diabetes, and a prion-related disease. 



18. A method for identifying one or more genes 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived, comprising: 
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(a) amplifying insert DNA from the first cDNA library 
by polymerase chain reaction; 

(b) amplifying insert DNA from each of the plurality of 
additional cDNA libraries by polymerase 

chain reaction; 

^ (c) hybridizing DNA amplified in step (a) with itself; 

(d) hybridizing each separate population of DNA 
amplified in step (b) with itself; 

(e) contacting DNA hybridized in step (c) with 
immobilized MutS; 

(f) contacting each separate population of DNA 
hybridized in step (d) individually with 
immobilized MutS; 

(g) separating unbound DNA from bound DNA contacted in 
step (e) ; 

(h) separating unbound DNA from bound DNA contacted in 
step (f ) ; 

(i) labeling unbound DNA separated in step (g) by 
polymerase chain reaction using unlabeled primers; 

(j) labeling each separate population of unbound DNA 

separated in step (h) by polymerase chain reaction 
using at least one primer having a distinguishable 
5 ' -peptide-label capable of binding a partner 
^® molecule immobilized on a substrate; 

(k) hybridizing DNA labeled in step (i) with DNA 
labeled in step (j); 

(1) contacting DNA hybridized in step (k) with 
immobilized MutS; 

(m) separating unbound DNA from bound DNA contacted in 
step (1); 

(n) contacting unbound DNA separated in step (m) with 
one or more partner molecules capable of binding 
the distinguishable 5 • -peptide-labeled primers; and 

(o) separating unbound DNA from bound DNA contacted in 
step (n) , 

which unbound DNA separated in step (o) encodes one or more 
identified genes underlying the defined phenotype. 
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19- A method for identifying one or more alleles 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a plurality 
^ of additional cDNA libraries is derived, comprising: 

^ (a) hybridizing insert DNA from the first cDNA library 

M with itself; 

g (b) hybridizing insert DNA from each of the plurality 

of additional cDNA libraries with itself; 

(c) contacting DNA hybridized in step (a) with a first 
.... 

immobilized mismatch binding protein; 

(d) contacting each separate population of DNA 
hybridized in step (b) individually with a second 
immobilized mismatch binding protein; 

(e) separating unbound DNA from bound DNA contacted in 
step (c) ; 

M (f ) separating unbound DNA from bound DNA contacted in 

^■^ step (d) ; 

(g) labeling each separate population of unbound DNA 
separated in step (f) with a distinguishable label 
capable of binding a partner molecule immobilized 
on a substrate ; 

(h) hybridizing DNA labeled in step (g) with unbound 
DNA separated in step (e) ; 

(i) contacting DNA hybridized in step (h) with a third 
immobilized mismatch binding protein; 

(j) separating unbound DNA from bound DNA contacted in 
step (i); 

(k) releasing bound DNA separated in step (j) from the 

immobilized mismatch binding protein; 
(1) contacting DNA released in step (k) with one or 

more partner molecules capable of binding the 

distinct labels; 
(m) denaturing DNA contacted in step (1) ; and 
(n) separating unbound DNA from bound DNA denatured in 

step (m) , which unbound DNA separated in step (n) 
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encodes one or more identified alleles underlying 
the defined phenotype. 

20. The method of Claim 18 or Claim 19, wherein at 
least one cDNA library is normalized* 

5 

21. The method of Claim 18 or Claim 19, wherein at 
least one cDNA library is linearized. 

22. The method of Claim 19, wherein labeling is carried 
out by polymerase chain rea^^tion using 5 '-peptide labeled 
primers. 



23. The method of Claim 19, wherein at least one 
immobilized partner molecule is an antibody. 

i 24. The method of Claim 23, wherein the antibody is an 

anti-peptide antibody. 

25. The method of Claim 19, wherein the hybridization 
in step (h) is carried out using an excess of labeled DNA. 

26. The method of Claim 19, wherein the excess of 
2° labeled DNA is a three-fold excess. 



27. The method of Claim 19, wherein at least one of the 
immobilized mismatch binding proteins is MutS. 

... 28. A method for identifying one or more alleles 

underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
I not displayed by a cell or individual from which a plurality 

of additional cDNA libraries is derived, comprising: 

(a) amplifying insert DNA from the first cDNA library 
by polymerase chain reaction; 

30 
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(b) amplifying insert DNA from each of the plurality of 
additional cDNA libraries by polymerase chain 
reaction; 

(c) hybridizing DNA amplified in step (a) with itself; 

(d) hybridizing DNA amplified from each library in step 
^ (b) with itself; 

1 (e) contacting DNA hybridized in step (c) with 

I immobilized MutS; 

(f) contacting each population of DNA hybridized in 
step (d) individually with immobilized MutS; 

(g) separating unbound DNA from bound DNA contacted in 
step (e) ; 

(h) separating unbound DNA from bound DNA contacted in 
step (f ) ; 

(i) amplifying unbound DNA separated in step (g) by 
polymerase chain reaction using unlabeled primers; 

I (j) amplifying and labeling each population of unbound 

i DNA separated in step (h) by polymerase chain 

reaction using a distinguishable 5 ' -peptide-labeled 

primer; 

(k) hybridizing DNA amplified and labeled in step (j) 

with DNA amplified in step (i) ; 
(1) contacting DNA hybridized in step (k) with 

immobilized MutS; 
(m) separating unbound DNA from bound DNA contacted in 

step (1) ; 

(n) releasing bound DNA separated in step (m) from 
immobilized MutS; 

^ (o) contacting DNA released in step (n) with one or 

more immobilized antibodies specific for each 
distinguishable 5 • -peptide-labeled primer; 

4 (P) denaturing DNA contacted in step (o) ; and 

(q) separating unbound DNA from bound DNA denatured in 
step (p) , 

which unbound DNA separated in step (q) encodes one or more 
identified alleles underlying the defined phenotype. 
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29. The method of Claim 28, wherein releasing bound DNA 
from immobilized MutS in step (n) is carried out using ATP or 
proteinase K. 

30. The method of any one of Claims 1, 7, 18, 19 and 
28, which further comprises using the one or more genes or 
alleles identified to carry out a prognosis or a diagnosis. 

31. The method of claim 30, wherein the one or more 
genes or alleles identified, or an encoded protein thereof, 
is a target for drug intervention, 

32. The method of claim 1, wherein the plurality of 
source populations is in the range of three to twelve source 
populations. 

33. The method of claim 1, wherein the plurality of 
source populations is in the range of three to six source 
populations. 



34. The method of claim 1, wherein the plurality of 
source populations consists of four source populations. 

35. A method for identifying one or more genes 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived, comprising: 

(a) hybridizing insert DNA from each cDNA library with 
itself; 

(b) contacting each separate population of DNA 
hybridized in step (a) individually with a first 
iiomobilized mismatch binding protein; 

(c) separating unbound DNA from bound DNA contacted 
individually in step (b) ; 

(d) labeling each separate population of unbound DNA 
separated in step (c) with a distinguishable label 
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capable of binding a partner molecule iininobilized 
on a substrate ; 

(e) hybridizing DNA separately labeled in step (d) ; 

(f) contacting DNA hybridized in step (e) with a second 
^ immobilized mismatch binding protein; and 

5 (g) separating unbound DNA from bound DNA contacted in 

I step (f ) . 

I 

36, A method for identifying one or more genes 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
^0 not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived, comprising: 

(a) amplifying -i^.ccrt DNA from each cDNA library by 
polymerase chain reaction; 

(b) hybridizing each separate population of DNA 
^ amplified in step (a) with itself; 

(c) contacting each separate population of DNA 
hybridized in step (b) individually with 
immobilized MutS; 

(d) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(e) labeling each separate population of unbound DNA 
separated in step (d) by polymerase chain reaction 
using at least one primer having a distinguishable 
5 '-peptide-label capable of binding a partner 
molecule iinmobilized on a substrate; 

(f ) hybridizing DNA labeled in step (e) ; 

■c^ (g) contacting DNA hybridized in step (f) with 

immobilized MutS; and 
(h) separating unbound DNA from bound DNA contacted in 
i step (g) . 



37. A method for identifying one or more alleles 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
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not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived, comprising: 

(a) hybridizing insert DNA from each cDNA library with 
itself ; 

(b) contacting each separate population of DNA 
hybridized in step (a) individually with a first 
immobilized mismatch binding protein; 

(c) separating unbound DNA from bound DNA contacted in 
step (b) ; 

(d) labeling each separate population of unbound DNA 
separated in step (c) with a distinguishable label 
capable of binding a partner molecule immobilized 
on a substrate; 

(e) hybridizing DNA labeled in step (d) ; 

(f) contacting DNA hybridized in step (e) with a second 
immobilized mismatch binding protein; and 

(g) separating unbound DNA from bound DNA contacted in 
step (f ) . 



38. A method for identifying one or more alleles 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a plurality 
of additional cDNA libraries is derived, comprising: 

(a) amplifying insert DNA from each cDNA library by 
polymerase chain reaction; 

(b) hybridizing DNA amplified from each library in step 

(a) with itself; 

(c) contacting DNA from each library hybridized in step 

(b) individually with a first immobilized mismatch 
binding protein; 

(d) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(e) amplifying and labeling each separate population of 
unbound DNA separated in step (d) by polymerase 
chain reaction using at least one primer having a 
distinguishable 5 • -peptide-label ; 



wo 99/36575 



PCT/US99/01037 



I 



(f ) hybridizing DNA amplified and labeled in step (e) ; 

(g) contacting DNA hybridized in step (f) with a second 
immobilized mismatch binding protein; 

(h) separating unbound DNA from bound DNA contacted in 
step (g) ; 

(i) releasing bound DNA separated in step (h) ; and 
(j) separating DNA released in step (i) into single 

strands. 



39. A method for identifying one or more alleles 
underlying a defined phenotype comprising the following steps 
in the order stated: 

(a) removing mismatched duplex nucleic acid molecules 
formed from hybridization within each of a 
plurality of source populations of nucleic acids; 

(b) retaining mismatched duplex nucleic acid molecules 
I formed from hybridization among the plurality of 

H source populations; and 

(c) separating mismatched strands retained in step (b) , 
which separated strands comprise one or more 
alleles underlying the defined phenotype, 

40. A method for identifying one or more genes 

20 underlying a defined phenotype comprising the following steps 
in the order stated: 

(a) removing mismatched duplex nucleic acid molecules 

formed from hybridization within each of two source 
populations of nucleic acids; and 
n (b) retaining mismatched duplex nucleic acid molecules 

formed from hybridization between the two source 
populations, 

the retained molecules in step (b) comprising the one or more 
genes underlying the defined phenotype, 

41. A method for identifying one or more genes 

30 underlying a defined phenotype comprising the following steps 
in the order stated: 
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(a) removing mismatched duplex nucleic acid molecules 
formed from hybridization within a first source 
population of nucleic acids; and 

(b) retaining mismatched duplex nucleic acid molecules 
formed from hybridization between the first source 

^ population and a second source population of 

nucleic acids, 

the retained molecules in step (b) comprising the one or more 
genes underlying the defined phenotype. 

42. A method for identirying one or more genes 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a second 
cDNA library is derived, comprising: 

(a) hybridizing insert DNA from the first cDNA library 
with itself; 

(b) hybridizing insert DNA fron the second cDNA library 
with itself; 

(c) contacting the DNA hybridized in step (a) with a 
first immobilized mismatch binding protein; 

(d) contacting the DNA hybridized in step (b) with a 
second immobilized mismatch binding protein; 

(e) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(f) separating unbound DNA from bound DNA contacted in 
step (d) ; 

(g) labeling unbound DNA separated in step (f) with a 
label capable of binding a partner molecule 
immobilized on a substrate; 

(h) hybridizing DNA labeled in step (g) with unbound 
DNA separated in step (e) ; 

(i) contacting DNA hybridized in step (h) with a third 
immobilized mismatch binding protein; 

(j) separating unbound DNA from bound DNA contacted in 
step (i); 
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(k) contacting unbound DNA separated in step (j) with 
the partner molecule immobilized on the substrate 
capable of binding the label; and 

(1) separating unbound DNA from bound DNA contacted in 
step (k) , 

which unbound DNA separated in step (1) encodes one or more 
identified genes underlying the defined phenotype . 

43. A method for identifying one or more genes 
underlying a defined phenotype from organisms having 
consanguinity comprising: 

(a) hybridizing insert DNA from a first collection of 
cDNA libraries derived from organisms having the 
defined phenotype with itself; 

(b) contacting the DNA hybridized in step (a) with a 
first immobilized mismatch binding protein; 

(c) separating unbound DNA from bound DNA contacted in 
step (b) ; 

(d) labeling unbound DNA separated in step (c) with a 
label capable of binding a partner molecule 
immobilized on a substrate; 

(e) hybridizing DNA labeled in step (d) with insert DNA 
from a second collection of cDNA libraries derived 
from organisms not having the defined phenccype; 

(f) contacting DNA hybridized in step (e) with a second 
immobilized mismatch binding protein; 

(g) separating unbound DNA from bound DNA contacted in 
step (f ) ; 

(h) contacting unbound DNA separated in step (a) with 
the partner molecule immobilized on the substrate 
capable of binding the label; and 

(i) separating unbound DNA from bound DNA contacted in 
step (h) , 

which unbound DNA separated in step (i) encodes one or more 
identified genes underlying the defined phenotype. 
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44. A method for identifying one or more genes 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a second 
cDNA library is derived, comprising: 

(a) amplifying insert DNA from the first cDNA library 
by polymerase chain reaction; 

(b) amplifying insert DNA from the second cDNA library 
by polymerase chain reaction; 

(c) hybridizing DNA amplified in step (a) with itself; 

(d) hybridizing DNA amplified in step (b) with itself; 

(e) contacting DNA hybridized in step (c) with a first 
immobilized MutS; 

(f ) contacting DNA hybridized in step (d) with a second 
immobilized MutS; 

(g) separating unbound DNA from bound DNA contacted in 
step (e) ; 

(h) separating unbound DNA from bound DNA contacted in 
step (f); 

(i) amplifying unbound DNA separated in step (g) by 
polymerase chain reaction using unlabeled primers; 

(j) amplifying and labeling unbound DNA separated in 

step (h) by polymerase chain reaction using 5*- 

hLotinylated primers; 
(k) hybridizing DNA amplified and labeled in step (j) 

with DNA amplified in step (i) ; 
(1) contacting DNA hybridized in step (k) with a third 

immobilized MutS; 
(m) separating unbound DNA from bound DNA contacted in 

step (1); 

(n) contacting unbound DNA separated in step (m) with 

immobilized streptavidin ; and 
(o) separating unbound DNA from bound DNA contacted in 

step (n) , 

which unbound DNA separated in step (o) encodes one or more 
identified genes underlying the defined phenotype. 
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45. A method for identifying one or more genes 
underlying a disease phenotype from healthy and affected 
individuals having consanguinity comprising: 

(a) amplifying insert DNA from a first collection of 
cDNA libraries derived from affected individuals by 
polymerase chain reaction; 

(b) hybridizing DNA amplified in step (a) with itself; 

(c) contacting DNA hybridized in step (b) with a first 
immobilized MutS; 

(d) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(e) amplifying and labeling unbound DNA separated in 
step (d) by polymerase chain reaction using 5 * - 
biotinylated primers; 

(f ) amplifying insert DNA from a second collection of 
cDNA libraries derived from healthy individuals by 
polymerase chain reaction; 

(g) hybridizing DNA amplified and labeled in step (e) 
with DNA amplified in step (f ) ; 

(h) contacting DNA hybridized in step (g) with a second 
immobilized MutS; 

(i) separating unbound DNA from bound DNA contacted in 
step (h) ; 

(j) contacting unbound DNA separated in step (i) wirh 

immobilized streptavidin; and 
(k) separating unbound DNA from bound DNA contacted in 

step ( j ) , 

which unbound DNA separated in step (k) encodes one or more 
identified genes under ly in,, the disease phenotype. 

46- A method for identifying one or more alleles 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a second 
cDNA library is derived, comprising: 

(a) hybridizing insert DNA from the first cDNA library 
with itself; 
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(b) hybridizing insert DNA from the second cDNA library 
with itself; 

(c) contacting DNA hybridized in step (a) with a first 
immobilized mismatch binding protein; 

(d) contacting DNA hybridized in step (b) with a second 
^ immobilized mismatch binding protein; 

(e) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(f) separating unbound DNA from bound DNA contacted in 
step (d) ; 

(g) labeling unbound DNA separated in step (f) with a 
label capable of binding a partner molecule 
immobilized on a substrate; 

(h) hybridizing DNA labeled in step (g) with unbound 
DNA separated in step (e) ; 

(i) contacting DNA hybridized in step (h) with a third 
immobilized mismatch binding protein; 

(j) separating unbound DNA from bound DNA contacted in 
step (i) ; 

(k) releasing bound DNA separated in step (j) from the 

third immobilized mismatch binding protein; 
(1) contacting DNA released in step (k) with the 

partner molecule immobilized on the substrate 

capable of binding the label; 
(m) denaturing DNA contacted in step (1) ; and 
(n) separating unbound DNA from bound DNA denatured in 

step (m) , 

which unbound DNA separated in step (n) encodes one or more 
identified alleles underlying the defined phenotype. 

25 

47. A method for identifying one or more alleles 
underlying a defined phenotype from organisms having 
consanguinity comprising: 

(a) hybridizing insert DNA from a first collection of 
cDNA libraries derived from organisms having the 
defined phenotype with itself; 

- 108 - 



BNSDCXJID: <WO 9936575Al..l_> 



wo 99/36575 



PCT/US99/01037 



(b) contacting DNA hybridized in step (a) with a first 
immobilized mismatch binding protein; 

(c) separating unbound DNA from bound DNA contacted in 
step (b) ; 

(d) labeling unbound DNA separated in step (c) with a 
label capable of binding a partner molecule 
immobilized on a substrate; 

(e) hybridizing DNA labeled in step (d) with insert DNA 
from a second collection of cDNA libraries derived 
from organisms not having the defined phenotype; 

(f) contacting DNA hybridized in step (e) with a second 
immobilized mismatch binding protein; 

(g) separating unbound DNA from bound DNA contacted in 
step (f ) ; 

(h) releasing bound DNA separated in step (g) from the 
second immobilized mismatch binding protein; 

(i) contacting DNA released in step (h) with the 
partner molecule immobilized on the substrate 
capable of binding the label; 

(j) denaturing DNA contacted in step (i); and 
(k) separating bound DNA from unbound DNA denatured in 
step (j), 

which bound DNA separated in step (k) encodes one or more 
identified alleles underlying the defined phenotype, 

48. A method for identifying one or more alleles 
underlying a defined phenotype displayed by a cell or 
individual from which a first cDNA library is derived, but 
not displayed by a cell or individual from which a second 
cDNA library is derived, comprising: 

(a) amplifying insert DNA from the first cDNA library 
by polymerase chain reaction; 

(b) amplifying insert DNA from the second cDNA library 
by polymerase chain reaction; 

(c) hybridizing DNA amplified in step (a) with itself; 

(d) hybridizing DNA amplified in step (b) with itself; 
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(e) contacting DMA hybridized in step (c) with a first 
imitiobilized MutS; 

(f) contacting DNA hybridized in step (d) with a second 
immobilized MutS; 

(g) separating unbound DNA from bound DNA contacted in 
^ step (e) ; 

(h) separating unbound DNA from bound DNA contacted in 
step (f ) ; 

(i) amplifying unbound DNA separated in step (g) by 
polymerase chain reaction using unlabeled primers; 

(j) amplifying and labeling unbound DNA separated in 

step (h) by polymerase chain reaction using 5*- 

biotinylated primers; 
(k) hybridizing DNA amplified and labeled in step (j) 

with DNA amplified in step (i) ; 
(1) contacting DNA hybridized in step (k) with a third 

immobilized MutS; 
(m) separating unbound DNA from bound DNA contacted in 

step (1); 

(n) releasing bound DNA separated in step (m) from the 

third immobilized MutS; 
(o) contacting DNA released in step (n) with 

immobilized streptavidin; 
(P) denaturing DNA contacted in step (o) ; 
(q) separating unbound DNA from bound DNA denatured in 

step (p) , 

which unbound DNA separated in step (q) encodes one or more 
identified alleles underlying the defined phenotype, 

49. A method for identifying one or more affected 
alleles underlying a disease phenotype from healthy and 
affected individuals having consanguinity comprising: 

(a) amplifying insert DNA from a first collection of 
cDNA libraries derived from affected individuals by 
polymerase chain reaction; 

(b) hybridizing DNA amplified in step (a) with itself; 
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(c) contacting DNA hybridized in step (b) with a first 
iiiunobilized MutS; 

(d) separating unbound DNA from bound DNA contacted in 
step (c) ; 

(e) amplifying and labeling unbound DNA separated in 
^ step (d) by polymerase chain reaction using 5*- 

biotinylated primers; 

(f) amplifying insert DNA from a second collection of 
cDNA libraries derived from healthy individuals by 
polymerase chain reaction; 

(g) hybridizing DNA amplified and labeled in step (e) 
•^^ with DNA amplified in step (f ) ; 

(h) contacting DNA hybridized in step (g) with a second 
immobilized MutS; 

(i) separating unbound DNA from bound DNA contacted in 
step (h) ; 

(j) releasing bound DNA separated in step (i) from the 

second immobilized MutS; 
(k) contacting DNA released in step (j) with 

immobi 1 i z ed s tr ept a vid in ; 
(1) denaturing DNA contacted in step (k) ; 
(m) separating bound DNA from unbound DNA denatured in 

step (1), 

2 0 which bounu DNA separated in step (m) encodes one or more 

identified affected alleles underlying the disease phenotype. 
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BLASTX ^/^^ 

Query= § 3 

(248 letters) 

Tronslating both strands of query sequence in all 6 reading frames 

Database: Non-redundont GenBonk CDS 

trans lations+PDB+SwissProt+SPupdote+PIR 
204.173 sequences; 58,078.033 total letters. 

Smallest 
Sum 

Reading High Probability 
Sequences producing High-scoring Segment Pairs: Frome Score P(N) N 



gi 1984587 (038582) DinP 

gi 1 1303959 (D84432) YqjH 



Escherichia col i] ...+3 128 3.2e-14 2 
Bacillus subtil is] +3 121 1.9e-12 2 



pirj|H64239 UV protection protein mucB homolo. .. +3 88 6.5e-05 1 

gi|984587 (D38582) DinP [Escherichio coli] gi|1208981 (DS83536) unknown 
[Escherichia col i] 
Length = 351 

Plus Strand HSPs: 

Score = 128 (58.3 bits). Expect = 3.2e-14. Sum P(2) = 3.2e-14 
Identities = 28/69 (40%), Positives = 40/69 {51%), Frame = +3 

Query: 39 SGAXLAAQLRHDIYKOXRLTSSVGVSYNKLLAKLGSXFNKPNGVTVITXENRLXFLXHXP 218 

S +A++R 1+ + +LT+SGV+ KLAK+S NKPNG VI T FL P 
Sbjct: 118 SATLIAQEIROTIFNELQLTASAGVAPVKFLAKIASDMNKPNGQFVITPAEVPAFLQTLP 177 

Query: 219 IGEFRGVGE 245 (SEQ ID NO: 30) 
+ + GVG+ 

Sbjct: 178 LAKIPGVGK 186 (SEQ ID NO: 31) 

Score = 48 (21.8 bits). Expect = 3.2e-14, Sum P(2) = 3.2e-14 
identities = 9/10 (90%), Positives = 10/10 (100%), Fromes = +3 

Query: 3 DEAYLDVTDN 32 (SEQ ID NO: 33) 

DEAYLDVTD+ (SEQ ID NO: 35) 

Sbjct: 103 DEAYLDVTOS 112 (SEQ ID NO: 34) 

gi 1 1303959 (084432) YqjH [Bocilius subtil is] 
Length = 414 

FIG.7A 
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Plus strand HSPs: 

Score = 121 (55.1 bits), Expect = 1.9e-12. Sum P(2) = 1.9e-12 
identities = 24/65 {36%), Positives - W55 (58%). Frame = +3 

Query: 54 AAQLRHDIYKOXRLTSSVGVSYNKLLAKLGSXFNKPNGVTVITXENRLXFLXHXPIGEFR 233 

A +++ + K+ L SS-fG-H- NK LAK+ S KP G+T-H- L P-KJE 

Sbjcl: 127 AKEIQSRLQKELLLPSSIGIAPNKFLAKMASDMKKPLGITILRKROVPDILWPLPVGEUH 186 

Query: 234 GVGEK 248 (SEQ ID NO: 36) 
GVG+K 

Sbjct: 187 GVGKK 191 (SEQ ID NO: 37^ 

Score = 43 (19.6 bits), Expect = 1.9e-12, Sum P(2) = 1.9e-12 
Identities = 8/17 (47%), Positives = 10/17 (58%). Frome = +3 

Query: 3 DEAYLDVTDNALSGAXL 53 (SEQ ID NO: 38) 

DE Y+D+TD S L 
Sbjct: 108 DEGYMDMTDTPYSSRAL 124 (SEQ ID NO: 39) 

pir||H64239 UV protection protein mucB homoiog - Mycoplasma genital ium 
(SGC3) gi 1 1046068 (U39720) UV protection protein [Mycoplasma 

gen i to li urn] 

Length = 411 
Plus Strond HSPs: 

Score = 88 (40.1 bits). Expect = 6.5e-05. P = 6.5e-05 
Identities = 20/66 (30%). Positives = 39/56 (59%), Frame =+3 

Query: 51 LAAQLRHDIYKOXRLTSSVGVSYNKLLAKLGSXFNKPNGVTVITXENRLXFLXHXPIGEF 230 

+A ++++ +++ R+ S-tG+S + L+AK+ S KP G+ +4+ L Pit 
Sbjct: 136 lAKKIKNFVFQNLRIKISIGISDHFLIAKlFSNQAKPFGIKSCSVKDIKKKLWPLPITEI 195 

Query: 231 RGVGEK 248 (SEQ ID NO; 40) 
G-tCEK 

Sbjct: 196 PGIGEK 201 (SEQ ID NO: 4i) 

sp|P18642|IMPB_SALTY IMPB PROTEIN. pir||JQ0661 impB protein - 

Sol moneiio typhimurium plasmid TPIIO gi|47748 (X53528) impB gene 
product (AA 1-424) [Solmonella typhimurium] 
Length = 424 

Plus Strond HSPs: 

Score = 49 (22.3 bits), Expect = 0.00024. Sum P(3) = 0.00024 
Identities = 12/37 (32%). Positives = 18/37 (48%). Frame =+3 

FIG.7B 
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BLASTX VI 2 

Query= ♦! 

(388 letters) 

Tronslating both strands of query sequence in oil 6 reading frames 

Database: Non-redundant GenBank CDS 

Irons! at ions+PDB+SwissProt+SPupdate+PIR 

258.816 sequences; 73,256.548 total letters. 
Searching done 

Smallest 
Sum 

Reading High Pr obob 11 ity 

Sequences producing High-scoring Segment Poirs: Frome Score P(N) N 

g'|984587 (D38582) DinP [Escherichio coli] ... +3 115 9.8e-n 2 

sp|P54545|YQJH_BACSU HYPOTHETICAL 47.0 KD PROTEIN IN G... +3 108 4.5e-08 2 

gi|1706953 (U52110) Dbh [Sulfolobus solfotor... +3 99 6.0e-06 2 

sp|P54560|YQJOACSU HYPOTHETICAL 45.9 KD PROTEIN IN G... +3 101 4.3e-05 1 

spiP34409|YLW6_CAEEL HYPOTHETICAL 59.1 KD PROTEIN F22B... +3 82 0.0014 2 

gnijPIDl 8290932 (Z83866) unknown [Mycobacterium I. .. +3 84 0.0043 2 

gi 1984587 (D38582) DinP [Escherichia coli] gnl |PID|dl012651 (03535) 

unknown [Escherichio coli] gi|1552799 (U70214) DinP [Escherichio 
- gi|1786425 (AE000131) hypotheticol protein DinP [Escherichio 



col I 
col i 
Leng 



h = 351 



Plus Strand HSPs: 

Score = 115 (52.9 bits), Expect = 2.9e-lG, Sum P(2) = 2-9e-10 
Identities = 26/61 (42%). Positives = 34/61 (55%). Fr cme = +3 

Query: 3 OLGIHSAMRSAEARRLAPDGIFLTPDFAKYKAISKQIHAVFRTITPKIEAVALDEAYLDV 182 

+ G+ SAM + A +L P L F YK S 1 +F T +IE -H-LDEAYLDV 
Sbjct: 50 KFGVRSAMPTGMALKLCPHLTLLPGRFDAYKEASNHIREIFSRYTSRIEPLSLOEAYLDV 109 

Query: 183 i i85 (SEQ ID NO: 42) 
T 

Sbjct: 110 T 110 (SEQ ID NO: 43) 

Score = 47 (21.6 bits). Expect = 9.8e-ll. Sum P(2) = 9,8e-ll 
identities = 12/41 (29%). Positives = 20/41 (48%), Frame = +2 

Query: 218 QLRHDIYIHTRLL»FGGC!VYHTISEVGI*FNKPNGVTVIT 340 (SEQ ID NO: 45) 

■H« 1+ +L G ++++ NKPNG VIT 
Sbjct: 125 EIRQTIFNELQLTASAGVAPVKFLAKIASDMNKPNGOFVIT 165 (SEO ID NO: 46) 

Score = 47 (21.6 bits). Expect = 2.9e-10. Sum P(2) = 2.9e-10 

FIG.8A 
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Identities = 12/38 (31%). Positives = 18/38 (47%), Frome = +3 

Query: 198 SGALLAHSYGMTFIYTHDYSSSVGVSYTILLAKLGSDL 311 (SEQ ID NO: 48) 

S L+A T ■H-SGV+ LAK+SD+ 
Sbjct: 118 SATLIAQEIROTIFNELQLTASAGVAPVKFLAKIASDM 155 (SEQ ID NO: 49) 

sp|P54545lYQJH_BACSU HYPOTHETICAL 47.0 KD PROTEIN IN GLNQ-ANSR 

INTERGENIC REGION gnl (PID|dl013294 (084432) YqjH [Bacillus subtilis] 
Length = 414 

Pius Strond HSPs: 

Score = 108 (49.7 bits). Expect = 4.6e-08. Sum P(2) = 4.5e-08 
Identities = 22/68 (32%), Positives = 38/68 (55%), Frames = +3 

Query: 9 GIHSAMRSAEARRLAPOGIFLTPDFAKYKAISKQIHAVFRTITPKIEAVALDEAYLDVTA 188 

G+ + M +A+R P+ I L P+F +Y+ S+ + + R T +E V+40E Y+D+T 
Sbjct: 57 GVKTTMPVWOAKRHCPELIVLPPNFDRYRNSSRAkfTILREYTDLVEPVSIDEGYMDMTD 116 

Query: 189 NALSGALL 212 (SEQ ID NO: 50) 
S L 

Sbjct: 117 TPYSSRAL 124 (SEQ ID NO: 51) 

Score = 42 (19.3 bits), Expect = 4.5e-08, Sum P(2) = 4.6e-08 
Identities = 8/18 (44%), Positives = 13/18 (72%), Frame = +3 

Query: 258 SSVGVSYTILLAKLGSDL 311 (SEQ ID NO: 52) 

SS^<;•H• LAK+ SD+ 
Sbjct: 142 SSIGIAPNKFLAKMASDM 159 (SEQ ID NO: 53) 

gi 1 1706953 (U52110) Obh [Sulf Globus soifotoricus] 
Length = 354 

Plus Strond HSPs: 

Score = 99 (45.5 bits). Expect = 6.0e-06, Sum P(2) = 6.0e-06 
Identities = 22/61 (36%), Positives = 35/61 (57%). Frames = +3 

Query: 3 QLGIHSAMRSAEARRLAPDGIFLTPDFAKYKAISKQIHAVFRTITPKlEAVALDEAYI.nv 182 

+LG+ + M +A ++AP I++ Y+A S +1 + KIE ++OEAYLDV 
Sbjct : 52 KLGVKAGMPI IKAMQIAPSAIYVPMRKPIYEAFSNRIivWLLNKHADKIEVASIDEAYLDV 1 1 1 

Query: 183 T 185 (SEQ ID NO: 54) 
T 

Sbjct: 112 T 112 (SEQ ID NO: 55) 

Score = 37 (17.0 bits). Expext = 6.0e-06. Sum P(2) = 6.0e-06 
Identities = 12/43 (27%), Positives = 20/43 (46%), Frome = +3 

Query: 180 VTANALSGALLAHSYGMTFIYTHDYSSSVGVSYTILLAKLGSD 308 (SEQ ID NO: 57) 

V N -K; LA + ++VGV+ +LAK+-ri) 
Sbjct: 115 VEGNFENGIELARKIKQEILEKEKITVTVGVAPNKILAKIIAD 157 (SEQ ID NO: 58) 

FIG.8B 
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BLASTX ' 

Query= *2 

(383 letters) 

Translating both strands of query sequence in all 6 reading frames 

Databose: Non-redundonl GenBonk CDS 

t r ons 1 0 1 i ons+PDB+Sw i s sPro l+SPupdo t e+P I R 

258.816 sequences; 73,255.548 total letters. 
Seorching do"^ 

Smol lest 
Sum 

Reading High Probobility 

Sequences producing High-scoring Segment Poirs: Frame Score P(N) N 

sp|P54545|YQJH_BACSU HYPOTHETICAL 47.0 KO PROTEIN IN G... +1 83 8.5e-07 2 
gi|9B4587 (038582) DinP [Escherichio col i] ... +1 82 1.7e-05 2 

sp|P54545|YQJH_BACSU HYPOTHETICAL 47.0 KD PROTEIN IN GLNQ-ANSR 

INTERGENIC REGION gnl |PID!d1013294 (D84432) YqjH [Bacillus subtil is] 
Length = 414 

Plus Strand HSPs: 

Score = 83 (38.2 bits). Expect = 8.6e-07. Sum P(2) = 8.6e-07 
Identities = 16/41 (39%). Positives = 25/41 (50%), Frame = +1 

Query: 61 IFLTPDFAKYKAISKQIHAVFRTITPKIEAWIDEAYLDVT 183 (SEQ ID NO: 69) 

1 L P+F +Y+ S+ + + R T +E V IDE Y+D+T 
Sbjct: 75 IVLPPNFDRYRNSSRAMFTILREYTDLVEPVSIDEGYMDMT 115 (SEC ID NO: 70) 

Score = 59 (27.1 bits). Expect = 8.6e-07. Sum P(2) = 8.6e-07 
Identities = 11/20 (55%), Positives = 15/20 (80%), Frame = +1 

Query: 247 LTSSVGVSYNKLLAKLGSDL 306 (SEQ ID NO: 59) 

L SS+G44 NK LAK+ 50+ 
Sbjct: 140 LPSSIGIAPNKFLAKMASDM 159 (SEQ ID NO: 60) 

gi 1984587 (D38582) DinP [Escherichia coli] gnl |PID|dl012651 (D83535) 

unknown [Escherichia col i] gi|1552799 (U70214) DinP [Escherichio 
coli] gi 11785425 (AE000131) hypothetical protein DinP [Escherichio 
coli] 

Length = 351 
Plus Strand HSPs: 

Score = 82 (37.7 bits). Expect = 1.7e-05. Sum P(2) = 1.7e-05 

FIG.9A 
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Identities = 17/35 (48%), Positives = 21/35 ( 60%), Frome = +1 

Query: 79 FAKYKAISKQIHAVFRTITPKIEAWIDEAYLDVT 183 (SEQ ID NO: 61) 

F YK S 1 +F T +IE + +OEAYLDVT 
Sbjct: 76 FDAYKEASNHIREIFSRYTSRIEPLSLDEAYLDVT 110 (SEQ ID NO: 62) 

Score = 51 (23.5 bits), Expect = 1.7e-05, Sum P(2) = 1.7e-05 
Identifies = 11/20 (55%). Positives = 15/20 (75%), Frame = +1 

Query: 247 LTSSVGVSYNKLLAKLGSDL 305 (SEQ ID NO: 64) 

LT+S GV+ K LAK+ SD+ 
Sbjct: 136 LTASAGVAPVKFLAKIASDM 155 (SEQ 10 NO: 65) 

Score = 40 (18.4 bits). Expect = 0.00079. Sum P(2) = 0.00O79 
Identities = 8/18 (44%), Positives = 12/18 (66%). Frome = +3 

Query: 282 ISEVGI*FNKPOTVlT 335 (SEQ ID NO: 66) 

■H-H- NKPNG VIT 
Sbjct: 148 LAKIASD^NKPNGOFVIT 165 (SEO ID NO: 67) 



FIG.9B 
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SEQUENCE LISTING 



<11Q> Iris, Francois J. -M. 
Pourny , Jean -Louis 

<12 0> MULTIPLEX VGID 



I 

g <130> 9408-024 



<140> To be assigned 
<141> 1999-01-15 

<150> 09/007,905 
<151> 1998-01-15 

<160> 81 

<170> Patentln Ver. 2.0 



^ <210> 1 

<211> 329 



<212> DNA 

<213> Escherichia coli 
<400> 1 

gggccaccgc ttcaattttt ggcgtaattg tccgaaaaac ggcatgaatt tgcttagaaa 60 

tggctttata tttggcaaaa ccaggggtca gaaagatccc gtctggcgct aaacgccgcg 120 

cttctgcgga tcgcatggcc gaatgaatgc caagttggcg cgcgacatag ttcgccgtcg 180 

tcaccacccc acgaccccca gtttctgctg gatcacgcga aataattaat ggcrggtggc 240 

gtaatgccgg attgtcacgc atctcgactt gggcatagaa ggcatcgata tcaacatggn 300 
ggaattttac gtgtatcagt tgtcaataa 325 

<210> 2 
<211> 256 
<212> DNA 

<213> Escherichia coli 
<400> 2 

ccgggcgttt aggcagacgg gatctttctg acccctgatt ttgccaaata taaagccatt 60 
tctaagcaaa ttcatgccgt ttttcggaca atta-gccaa aaactgagcc ggtggtgatt 120 
gatgaggctt acttagatgt gaccgccaat gcgttgtcag gcgcactgct ggccgcacag 180 
ttacggcatg acatttatat acacacacga ttactctagt tcggtgggtg tatcgtatac 240 
catactatta gcgatg 256 

<210> 3 
<211> 248 
<212> DNA 

<213> Escherichia coli 
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<400> 3 

gggatgaggc ttacttagat gtgaccgaca atgcgttgtc aggcgcaatn ctggccgcac 60 

agttacggca Cgacatttat aaacaancac gnttaactag ttcggtgggt gtatcgtata 120 

acaaactatt agcgaagttg ggatctgant ttaatiaagcc aaacggtgtg acggtgatta 180 

cgncggaaaa ccgcctggnt tttttagntc atttnccgat tggtgaattt cgcggggtcg 240 
gtgagaaa 24 8 



<210> 4 

<211> 387 

<212> DNA 

<213> Escherichia coli 



<400> 4 

gccaacttgg cattcattcg gccatgcgat 

ggatctttct gacccctgar, tttgccaaat 

tttttcggac aattacgcca aaaattgaag 

tgaccgccaa tgcgttgtca ggcgcactgc 

tacacacacg attactctag ttcggtgggt 

ggatctgatt taataagcca aacggtgtga 

ttagtcattt ccgattggtg aatttcg 



ccgcagaagc gcggcgttta gcgccagacg 60 

ataaagccat ttctaagcaa attcatgccg 120 

cggtggccct tgatgaggct tacttagatg 180 

tggccgcaca gttacggcat gacatttata 240 

gtatcgtata ccatactatt agcgaggttg 300 

cggtgattac gcggaaaacc gcctggtttt 3 60 

387 



<210> 5 
<211> 381 
<212> DNA 

<213> Escherichia coli 



<400> 5 

gccaacttgg cattcattcg gccatgcgat ccgaagcgcg gcgtttaggc agcagacggg 60 

atctttctga cccctgattt tgccaaatat aaagccattt ctaagcaaat tcatgccgtt 120 

tttcggacaa ttacgccaaa aattgaagcg gtggtgattg atgaggctta cttagatgtg 180 

accgccaatg cgttgtcagg cgcaatctgg ccgcacagtt acggcatgac atttataaac 240 

aacacgttaa ctagttcggt gggtgtatcg tataacaaac tattagcgaa gttgggatct 300 

gatttaataa gccaaacggt gtgacggtga ttacgcggaa aaccgcctgg ttttttagtc 360 
atttccgatt c,^;gaatttc g 381 



<210> 6 
<211> 567 
<212> DNA 

<213> Escherichia coli 



<400> 6 

ttattgacaa ctgatacacg taaaattccc 

gtcgagatgc gtgacaatcc ggcattacgc 

gcagaaactg ggggtcgtgg ggtggtgacg 

tcattcggcc atgcgatccg cagaagcgcc 

ccctgatttt gccaaatata aagccatttc 

tacgccaaaa attgagccgg tggtgattga 



catgttgata tcgatgcctt ctatgcccaa 60 
caccagccat taattatttc gcgtgatcca 120 
acggcgaact atgtcgcgcc aacttggcat 180 
gggcgtttag gcagacggga tctttctgac 240 
taagcaaatt catgccgttt ttcggacaat 300 
tgaggcttac ttagatgtga ccgccaatgc 360 
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gttgtcaggc gcactgctgg ccgcacagtt acggcatgac atttataaac aacacgctaa 420 
ctagttcggt gggtgtatcg tataacaaac tattagcgaa gttgggatct gatttaataa 4 80 
gccaaacggt gtgacggtga ttacgcggaa aaccgcctgg ttttttagtc atttccgatt 540 
ggtgaatttc gcggggtcgg tgagaaa 5 67 

<210> 7 
<211> 63 
<212> PRT 

<213> Escherichia coli 
<40C 7 

Arg Gly Val Val Thr Thr Ala Asn Tyr Val Ala Arg Leu Gly He His 
^5 10 15 

Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu Ala Pro Asp Gly He 
20 25 30 

Phe Leu Thr Pro Asp Phe Ala Lys Tyr Lys Ala He Ser Lys Gin He 
35 40 45 

His Ala Val Phe Arg Thr He Thr Pro Lys He Glu Ala Val Ala 
50 55 60 



<210> 8 
<211> 64 
<212> PRT 

<213> Escherichia coli 
<400> 8 

Arg Gly Val He Ser Thr Ala Asn Tyr Pro Ala Arg Lys Phe Gly Val 
15 10 15 

Arg ser Ala Met Pro Thr Gly Met Ala Leu Lys Leu Cys Pro His Leu 
20 25 30 

Thr Leu Leu Pro Gly Arg Phe Asp Ala Tyr Lys Glu Ala Ser Asn His 
35 40 45 

He Arg Glu He Phe Ser Arg Tyr Thr Ser Arg He Glu Pro Leu Ser 
50 55 ^0 



<210> 9 
<211> 4 
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<212> PRT 

<213> Escherichia coli 
<400> 9 

Thr Ala Asn Tyr 
1 



<210> 10 
<211> 29 
<212> PRT 

<213> Escherichia coli 



<400> 10 

Lys Phe Xaa His Val Asp lie Asp Ala Phe Tyr Ala Gin Val Glu Met 
15 10 15 

Arg Asp Asn Pro Ala Leu Arg His Gin Pro Leu He He 
20 25 



<210> 11 
<211> 29 
<212> PRT 

<213> Escherichia coli 
<400> 11 

Lys He He Kis Val Asp Met Asp Cys Phe Phe Ala Ala Val Glu Met 
15 10 15 

Arg Asp Asn Pro Ala Leu Arg Asp He Pro He Ala He 
20 25 



<210> 12 
<211> 10 
<212> PRT 

<213> Escherichia coli 
<400> 12 

Val Glu Met Arg Asp Asn Pro Ala Leu Arg 
3-5 10 



<210> 13 

<211> 55 

<212> PRT 

<213> Escherichia coli 
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<400> 13 

Arg Gly Val Val Thr Thr Ala Asn Tyr Val Ala Arg Gin Leu Gly He 
15 10 15 

His Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu Ala Pro Asp Gly 
20 25 30 

He Phe Leu Thr Pro Asp Phe Ala Lys Tyr Lys Ala He Ser Lys Gin 
35 40 45 

He His Ala Val Phe Arg Thr 
50 55 



<210> 14 
<;211> 55 
<212> PRT 

<213> Escherichia coli 
<400> 14 

Arg Ser Val Val Ser Thr Cys Asn Tyr Val Ala Arg Ser Tyr Gly He 
15 10 15 

Arg Ser Gly Met Ser He Leu Lys Ala Leu Glu Leu Cys Pro Asn Ala 
20 25 30 

He Phe Ala Hi^ Ser Asn Phe Arg Asn Tyr Arg Lys His Ser Lys Arg 
35 40 45 

He Fhe Ser Val He Glu Ser 
50 55 



<210> 15 
<211> 5 
<212> PRT 

<213> Escherichia coli 
<40O> 15 

Asn Tyr Val Ala Arg 
1 5 



<210> 16 
<211> 28 
<212> PRT 

<213> Escherichia coli 
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<400> 16 

Phe Xaa His Val Asp lie Asp Ala 
1 5 

Asp Asn Pro Ala Leu Arg His Gin 
20 



Phe Tyr Ala Gin Val Glu Met Arg 
10 15 

Pro Leu lie lie 
25 



<210> 17 
<211> 28 
<212> PRT 

<213> Escherichia coli 
<400> 17 

Phe Leu Tyr Phe Asp Phe Asp Ala Phe Phe Ala Ser Val Glu Glu Leu 
^5 10 15 

Glu Asn Pro Glu Leu Val Asn Gin Pro Leu lie Val 
20 25 



<210> 18 
<211> 4 
<212> PRT 

<213> Escherichia coli 

<400> 18 
Gin Pro Leu He 
1 



<210> 19 
<211> 34 
<212> PRT 

<213> Escherichia coli 
<400> 19 

Val Asp lie Asp Ala Phe Tyr Ala 
1 5 

Ala Leu Arg His Gin Pro Leu He 
20 

Gly Gly 



Gin Val Glu Met Arg Asp Asn Pro 
10 15 

He Ser Arg Asp Pro Ala Glu Thr 
25 30 
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<210> 20 
<211> 34 
<212> PRT 

<213> Escherichia coli 
<400> 20 

Val Asp Met Gin Ser Phe Tyr Ala Ser Val Glu Lys Ala Glu Asn Pro 
IS 10 15 

His Leu Lys Asn Arg Pro Val He Val Ser Gly Asp Pro Glu Lys Arg 
20 -^r 30 

Gly Gly 



<210> 21 
<211> 59 
<212> PRT 

<213> Escherichia coli 
<400> 21 

Gly Val Val Thr Thr Ala Asn Tyr Val Ala Arg Gin Leu Gly He His 
^5 10 15 

Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu Ala Pro Asp Gly He 
20 25 30 

Phe Leu Thr Pro Phe Ala Lys Tyr Lys Ala He Ser Lys Gin He His 
35 40 45 

Ala Val Phe Arg Thr He Thr Pro Lys He Glu 
50 55 



<210> 22 
<211> 60 
<212> PRT 

<213> Escherichia coli 
<4O0> 22 

Gly Val Val Leu Ala Ala Cys Pro Leu Ala Lys Gin Lys Gly Val Val 
^5 10 15 

Asn Ala Ser Arg Leu Trp Glu Ala Gin Glu Lys Cys Pro Glu Ala Val 
20 25 30 

Val Leu Arg Pro Arg Met Gin Arg Tyr He Asp Val Ser Leu Gin He 
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35 40 45 

Thr Ala lie Leu Glu Glu Tyr Thr Asp Leu Val Glu 
50 55 60 



<210> 23 
<211> 60 
<212> PRT 

<213> Escherichia coli 
<400> 23 

Arg Gly Val Val Thr Thr Ala Asn Tyr Val Ala Arg Gin Leu Gly lie 
15 10 15 

His Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu Ala Pro Asp Gly 
20 25 30 

lie Phe Leu Thr Pro Asp Phe Ala Lys Tyr Lys Ala He Ser Lys Gin 



il 35 40 45 

He His Ala Val Phe Arg Thr He Thr Pro Lys He 
50 55 60 



<210> 24 
<211> 60 
<212> PRT 

<;213> Escherichia coli 
<400> 24 

Lys Gly He Val Val Thr Cys Ser Tyr Glu Ala Arg Ala Arg Gly Val 
IS 10 15 



Lys Thr Thr Met Val Trp Gin Ala Lys Arg His Cys Pro Glu Leu 

20 25 30 

He Val Leu Pro Pro Asn Phe Asp Arg Tyr Arg Asn Ser Ser Arg Ala 
35 40 45 

Met Phe Thr He Leu Arg Glu Tyr Thr Asp Leu Val 
50 55 60 



<210> 25 
<211> 35 
<;212> PRT 

<;213> Escherichia coli 
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i 



<400> 25 

lie Phe Leu Tyr Lys Ala lie Ser hys Gin lie His Ala Val Phe Arg 
15 10 15 

Thr He Thr Pro Lys He Glu Pro Val Val He Asp Glu Ala Tyr Leu 
20 25 30 

Asp Val Thr 
35 



<210> 26 
<211> 50 
<212> PRT 

<213> Escherichia coli 
<400> 26 

He Val Leu Pro Pro Asn Phe Asp Arg Tyr Arg Asn Ser Ser Arg Ala 
15 10 15 

Met Phe Thr He Leu Arg Glu Tyr Thr Asp Leu Val Glu Pro Val Ser 
20 25 30 

He Asp Glu Gly Tyr Met Asp Met Thr Asp Thr Pro Tyr Ser Ser Arg 
35 40 45 

Ala Leu 
50 



<210> 27 
<211> 35 
<212> PRT 

<213> Escherichia coli 
<400> 27 

Phe Ala Lys Tyr Lys Ala He Ser Lys Gin He His Ala Val Phe Arg 
1 5 10 15 

Thr He Thr Pro Lys He Glu Pro Val Val He Asp Glu Ala Tyr Leu 
20 25 30 

Asp Val Thr 
35 



c210> 28 
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<211> 35 
<212> PRT 

<213> Escherichia coli 
<400> 28 

Phe Asp Ala Tyr Lys Glu Ala Ser Asn His lie Arg Glu He Phe Ser 
15 10 15 

Arg Tyr Thr Ser Arg He Glu Pro Leu Ser Leu Asp Glu Ala Tyr Leu 
20 25 30 

Asp Val Thr 
35 



<210> 29 
<211> 8 
<212> PRT 

<213> Escherichia coli 
<400> 29 

Asp Glu Ala Tyr Leu Asp Val Thr 
1 5 



<210> 30 
<211> 69 
<212> PRT 

<213> Escherichia coli 
<400> 30 

Ser Gly Ala Xaa Leu Ala Ala Gly Leu Arg His Asp He Tyr Lys Gin 
^5 10 15 

^ Xaa Arg Leu -^^r Ser Ser Val Gly Val Ser Tyr Asn Lys Leu Leu Ala 

20 25 30 

;r: Ser Xaa Phe Asn Lys Pro Asn Gly Val Thr Val He Thr 

35 40 45 

Xaa Glu Asn Arg Leu Xaa Phe Leu Xaa His Xaa Pro He Gly Glu Phe 
50 55 60 

Arg Gly Val Gly Glu 
65 



<210> 31 
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11 

<211> 69 
<212> PRT 

<213> Escherichia coli 
<400> 31 

Ser Ala Thr Leu lie Ala Gin Glu He Arg Gin Thr lie Phe Asn Glu 
15 10 15 

Leu Gin Leu Thr Ala Ser Ala Gly Val Ala Pro Val Lys Phe Leu Ala 
20 25 30 

Lys lie Ala Ser Asp Met Asn Lys Pro Asn Gly Gin Phe Val He Thr 
35 40 45 

Pro Ala Glu Val Pro Ala Phe Leu Gin Thr Leu Pro Leu Ala Lys lie 
50 55 60 

Pro Gly Val Gly Lys 
65 



<210> 32 
<211> 5 
<212> PRT 

<213> Escherichia coli 
<400> 32 

Asn Lys Pro Asn Gly 
1 5 



<210> 33 
<211> 10 
<212> PRT 

<213> Escherichia coli 
<400> 33 

Asp Glu Ala Tyr Leu Asp Val Thr Asp Asn 
15 10 



<210> 34 
<211> 10 
<212> PRT 

<213> Escherichia coli 
<400> 34 

Asp Glu Ala Tyr Leu Asp Val Thr Asp Ser 
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10 



<210> 35 
<211> 9 
<212> PRT 

<213> Escherichia coli 
<400> 35 

Asp Glu Ala Tyr Leu Asp Val Thr Asp 
1 5 



<210> 36 
<211> 65 
<212> PRT 

<213> Escherichia coli 



<400> 36 
Ala Ala Gin Leu 
1 

Ser Val Gly Val 
20 

Phe Asn Lys Pro 
35 

Xaa Phe Leu Xaa 
50 



Arg His Asp He 
5 

Ser Tyr Asn Lys 



Asn Gly Val Thr 
40 

His Xaa Pro He 
55 



Tyr Lys Gin Xaa 
10 

Leu Leu Ala Lys 
25 

Val He Thr Xaa 



Gly Glu Phe Arg 

60 



Arg Leu Thr Ser 
15 

Leu Gly Ser Xaa 
30 

Glu Asn Arg He 
45 

Gly Val Gly Glu 



Lys 
65 



<210> 37 
<211> 65 
<212> PRT 

<213> Escherichia coli 



<400> 37 

Ala Lys Glu He Gin Ser Arg Leu 
1 5 

Ser He Gly He Ala Pro Asn Lys 
20 

Met Lys Lys Pro Leu Gly He Thr 



Gin Lys Glu Leu Leu Leu Pro Ser 
10 15 

Phe Leu Ala Lys Met Ala Ser Asp 
25 30 

He Leu Arg Lys Arg Gin Val Pro 
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35 40 45 

Asp lie Leu Trp Pro Leu Pro Val Gly Glu Met His Gly Val Gly Lys 
50 55 50 

Lys 
65 



<210> 38 
<211> 17 
<212> PRT 

<213> Escherichia coli 
<400> 38 



Asp Glu Ala Tyr Leu Asp Val Thr Asp Asn Ala Leu Ser Gly Ala Xaa 
Is 10 15 



Leu 



<210> 39 
<211> 17 
<212> PRT 

<213> Escherichia coli 
<400> 39 



Asp Glu Gly Tyr Met Asp Met Thr Asp Thr Pro Tyr Ser Ser Arg Ala 



Leu 



<210> 40 
<211> 66 
<212> PRT 

<213> Escherichia coli 
<400> 40 

Leu Ala Ala Gin Leu Arg His Asp He Tyr Lys Gin Xaa Arg Leu Thr 

ser Ser Val Gly Val Ser Tyr Asn Lys Leu Leu Ala Lys Leu Gly Ser 
20 2S 30 

xaa Phe Asn Lys Pro Asn Gly Val Thr Val He Thr Xaa Glu Asn Arg 



wo 99/36575 
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35 40 45 

Leu Xaa Phe Leu Xaa Glu Xaa Pro He Gly Glu Phe Arg Gly Val Gly 
50 55 60 

Glu Lys 
65 



<210> 41 
<211> 66 
<212> PRT 

<213> Escherichia coli 
<400> 41 

He Ala Lys Lys He Lys Asn Phe Val Phe Gin Asn Leu Arg He Lys 
^5 10 15 

He Ser He Gly He Ser Asp His Phe Leu lie Ala Lys He Phe Ser 
20 25 30 

Asn Gin Ala Lys Pro Phe Gly He Lys Ser Cys Ser Val Lys Asp He 
35 40 45 

Lys Lys Lys Leu Trp Pro Leu Pro He Thr Glu He Pro Gly He Gly 
50 55 60 

Glu Lys 
65 



<210> 42 
<211> 61 
<212> PRT 

<213> Escherichia coli 
<400> 42 

Gin Leu Gly He His Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu 
1 5 10 15 

Ala Pro Asp Gly He Phe Leu Thr Pro Asp Phe Ala Lys Tyr Lys Ala 
20 25 30 

He Ser Lys Gin He His Ala Val Phe Arg Thr He Thr Pro Lys He 
35 40 45 

Glu Ala Val Ala Leu Asp Glu Ala Tyr Leu Asp Val Thr 
50 55 60 
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15 



<210> 43 
<211> 61 
<212> PRT 

<213> Escherichia coli 



4 <400> 43 

f 



Lys Phe Gly Val Arg Ser Ala Met Pro Thr Gly Met Ala Leu Lys Leu 
15 10 15 

Cys Pro His Leu Thr Leu Leu Pro Gly Arg Phe Asp Ala Tyr Lys Glu 
20 25 30 

Ala Ser Asn His He Arg Glu He Phe Ser Arg Tyr Thr Ser Arg He 
35 40 45 

Glu Pro Leu Ser Leu Asp Glu Ala Tyr Leu Asp Val Thr 
50 55 60 



<210> 44 
<211> 9 
<212> PRT 

<213> Escherichia coli 
<400> 44 

Leu Asp Glu Ala Tyr Leu Asp Val Thr 
1 5 



<210> 45 

<2X1> 39 

<212> PRT 

<213> Escherichia coli 

<400> 45 

Gin Leu Arg His Asp He Tyr He His Thr Arg Leu Leu Phe Gly Gly 
1 5 10 15 

Cys lie Val Tyr His Thr He Ser Glu Val Gly He Phe Asn Lys Pro 
20 25 30 

Asn Gly Val Thr Val He Thr 
35 



<210> 46 
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<211> 41 
<212> PRT 

<213> Escherichia coli 
<400> 46 

Glu lie Arg Gin Thr lie Phe Asn Glu Leu Gin Leu Thr Ala Ser Ala 
15 10 15 

Gly Val Ala Pro Val Lys Phe Leu Ala Lys lie Ala Ser Asp Met Asn 
20 25 30 

Lys Pro Asn Gly Gin Phe Val IX e Thr 
35 40 



<210> 47 
<211> 5 
<212> PRT 

<213> Escherichia coli 
<400> 47 

Asn Lys Pro Asn Gly 
1 5 



<210> 48 
<211> 38 
<212> PRT 

<213> Escherichia coli 
<400> 48 

Ser Gly Ala Leu Leu Ala His Ser 
1 5 

His Asp Tyr Ser Ser Ser Val Gly 
20 

Lys Leu Gly Ser Asp Leu 
35 



Tyr Gly Met Thr Phe lie Tyr Thr 
10 15 

'^al Ser Tyr Thr lie Leu Leu Ala 
25 30 



<210> 49 
<211> 38 
<212> PRT 

<213> Escherichia coli 
<400> 49 

Ser Ala Thr Leu lie Ala Gin Glu lie Arg Gin Thr lie Phe Asn Glu 



1 



1 
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17 

15 10 15 

Leu Gin Leu Thr Ala Ser Ala Gly Val Ala Pro Val Lys Phe Leu Ala 
20 25 30 

Lys He Ala Ser Asp Met 
35 



<210> 50 
<211> 68 
<212> PRT 

<:213> Escherichia coli 
<400> 50 

Gly lie His Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu Ala Pro 
^ S 10 15 

Asp Gly He Phe Leu Thr Pro Asp Phe Ala Lys Tyr Lys Ala He Ser 
20 25 30 

Lys Gin lie Asx Ala Val Phe Arg Thr He Thr Pro Lys He Glu Ala 
35 40 45 

Val Ala Leu Asp Glu Ala Tyr Leu Asp Val Thr Ala Asn Ala Leu Ser 
50 55 60 

Gly Ala Leu Leu 
65 



0 



<210> 51 
<211> 68 
<212> PRT 

<213> Escherichia coli 
<400> 51 

Gly Val Lys Thr Thr Met Pro Val Trp Gin Ala Lys Arg His Cys Pro 
1 5 10 15 

Glu Leu He Val Leu Pro Pro Asn Phe Asp Arg Tyr Arg Asn Ser Ser 
20 25 30 

Arg Ala Met Phe Thr He Leu Arg Glu Tyr Thr Asp Leu Val Glu Pro 
35 40 45 

Val Ser He Asp Glu Gly Tyr Met Asp Met Thr Asp Thr Pro Tyr Ser 
50 55 60 
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Ser Arg Ala Leu 
65 



<210> 52 
<211> 18 
<212> PRT 

<213> Escherichia coli 
<400> 52 

Ser Ser Val Gly Val Ser Tyr Thr lie Leu Leu Ala Lys Leu Gly Ser 
15 10 15 

Asp Leu 



<210> 53 
<211> 18 
<212> PRT 

<213> Escherichia coli 
<400> 53 

Ser Ser lie Gly lie Ala Pro Asn Lys Phe Leu Ala Lys Met Ala Ser 
15 10 15 

Asp Met 



<210> 54 
<211> 61 
<212> PRT 

<213> Escherichia coli 
<400> 54 

Gin Leu Gly lie His Ser Ala Met Arg Ser Ala Glu Ala Arg Arg Leu 
15 10 15 

Ala Pro Asp Gly lie Phe Leu Thr Pro Asp Phe Ala Lys Tyr Lys Ala 
20 25 30 

lie Ser Lys Gin lie His Ala Val Phe Arg Thr lie Thr Pro Lys lie 
35 40 45 



Glu Ala Val Ala Leu Asp Glu Ala Tyr Leu Asp Val Thr 
50 55 60 
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WO-99/36575 

19 



<210> 55 
<211> 61 
<212> PRT 

<213> Escherichia coli 
<400> 55 

Lys Leu Gly Val Lys Ala Gly Met Pro lie lie Lys Ala Met Gin lie 
15 10 15 

Ala Pro Ser Ala He Tyr Val Pro Met Arg Lys Pro lie Tyr Glu Ala 
20 25 30 

Phe Ser Asn Arg He Met Asn Leu Leu Asn Lys His Ala Asp Lys He 
35 40 45 

Glu Val Ala Ser He Asp Glu Ala Tyr Leu Asp Val Thr 
50 55 60 



<210> 56 

<2X1> 8 

<212> PRT 

<213> Escherichia coli 

<400> 56 

Asp Glu Ala Tyr Leu Asp Val Thr 
1 5 



<210> 57 
<211> 43 
<212> PRT 

<213> Escherichia coli 
<400> 57 

Val Thr Ala Asn Ala Leu Ser Gly Ala Leu Leu Ala His Ser Tyr Gly 
15 10 15 

Met Thr Phe He Tyr Thr His Asp Tyr Ser Ser Ser Val Gly Val Ser 
20 25 30 

Tyr Thr He Leu Leu Ala Lys Leu Gly Ser Asp 
35 40 



<210> 58 



BNSDOCID: <W0 9936575A1_L> 



wo 99/36575 



20 



PCT/US99/01037 



<211> 43 
<212> PRT 

<213> Escherichia coli 



<400> 58 
Val Glu Gly Asn 
1 

Gin Glu He Leu 
20 

Pro Asn Lys He 
35 



phe Glu Asn Gly 
5 

Glu Lys Glu Lys 



Leu Ala Lys He 
40 



lie Glu Leu Ala 
10 

He Thr Val Thr 
25 

He Ala Asp 



Arg Lys lie Lys 
15 

Val Gly Val Ala 
30 



<210> 59 
<211> 20 
<212> PRT 

<2X3> Escherichia coli 



Leu'Thr'ser Ser Val Gly Val Ser Tyr Asn Lys Leu Leu Ala Lys Leu 
15 10 



Gly Ser Asp Leu 
20 



<210> 60 
<211> 20 
<212> PRT 

<213> Escherichia coli 



<400> 60 



Leu Pro ser Ser He Gly He Ala Pro Asn Lys Phe Leu Ala Lys 



Met 



10 



15 



Ala Ser Asp Met 
20 



<210> 61 
<211> 35 
<212> PRT 

<213> Escherichia coli 



Phe Ala Lys Tyr Lys Ala He Ser Lys Gin He His Ala Val Phe Arg 
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10 



15 



Thr He Thr Pro hys He Glu Ala Val Val He Asp Glu Ala Tyr Leu 
20 25 30 



Asp Val Thr 
35 



<210> 62 
<211> 35 
<212> PRT 

<213> Escherichia coli 



<400> 62 . ^ 

Phe A.p Ala Tyr Lys Glu Ser Asn His He Arg Glu He Phe Ser 

in 15 

1 



5 10 



Arq Tyr Thr Ser Arg He Glu Pro Leu Ser He Asp Glu Ala Tyr Leu 
20 25 30 



Asp Val Thr 
35 



<210> 63 

<2ii> e 

<212> PRT 

<213> Escherichia coli 
<400> 63 

Asp Glu Ala Tyr Leu Asp Val Thr 
1 5 



<210> 64 
<211> 20 
<212> PRT 

<213> Escherichia coli 



<400> 64 

Leu Thr ser Ser Val Gly Val Ser Tyr Asn Lys Leu Leu Ala Lys Leu 



1 5 



10 



15 



Gly Ser Asp Leu 
20 
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<210> 65 
<211> 20 
<212> PRT 

<213> Escherichia coli 

::r;.r\xa S.r ... aly Val AXa Pro Val Lys Phe Leu .la Lys Xle 



5 



1 

Ala Sex Asp Met 
20 



<210> 66 
<211> 17 
<212> PRT 

<213> Escherichia coli 



10 



ZV.\lu val aly XXe Phe Acn Lys Pro As. Oly Val T.r Val Xle 



5 



10 



Thr 



<210> 67 
<211> IB 
<212> PRT 

<213> Escherichia coli 



lT.l\ys lie Ala ser Asp Met Asn Lys Pro .sn Oiy Oln Phe Val 



5 10 



He Thr 



<210> €8 
<211> 5 
<212> PRT 

<213> Escherichia coli 
<400> 68 

^ Asn Lys Pro Asn Gly 

1 5 



BNSDOCID; <WO^..„9936575A1_L> 



wo 99/36575 



23 



PCT/US99/01037 



<210> 69 
<211> 41 
<2X2> PRT 

<213> Escherichia coli 



<400> 69 , sgj- Lys Gin 

lie Phe Leu Thr Pro Asp Phe Ala Lys Tyr ..ys Ala iXe 

1 5 



20 25 



lie ASP Glu Ala Tyr Leu Asp Val Thr 

35 4" 



<210> 10 
<211> 41 
<212> PRT 

<213> Escherichia coli 



;:r:.r... « ^ "° "* 

1 5 
Met Phe Thr He Leu Are Clu Tyr Thr Asp Leu Val Oiu Pro Val Ser 

20 

lie Asp Glu Gly Tyr Met Asp Met Thr 



35 



<210> 71 
<211> 11 
<212> PRT 

<213> Oryctolagus cuniculus 



<400> 71 

Lys Phe ser Arg Glu Lys Lys Ala Ala Lys Thr 

^ c: 10 



<210> 72 
<211> 11 
<212> PRT 

<213> oryctolagus cuniculus 
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::r-"^^= t. 
1 = 



<210> 73 

<211> 15 

<212> PRT 

<2X3> Oryctolagus 



cuni cuius 



<400> 73 Lys Lys Thr 

;..p Leu Lys Glu Glu Lys ^sp He Asn A.. 



<210> 74 
<211> 9 
<212> PRT 

<213> oryctolagus cuniculus 



<400> 74 

Cys Thr Gly Glu Glu Asp Thr Ser Glu 
5 



<210> 75 

<211> 11 

<212> PRT 

<213> oryctolagus 



cuniculus 



TrrGl^Glu Thr Gin Thr Gin Asp Gin Pro Het 
1 ^ 



<:211> 13 

<212> PRT 

<213> oryctolagus 



cuniculus 



r:;.%.. -V p» «v 

1 ^ 



<210> 77 
<211> 12 
<212> PRT 
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<213> Oryctolagus cuniculus 

r n'-^v ax. n= ... - 

1 ^ 



<210> 78 
<211> 10 
<212> PRT 

<213> oryctolagus cuniculus 



<^^0> 78 

Gin Arg Ala Asp Ser Leu Ser Ser Hxs Leu 



1 5 



<210> 79 
<211> 9 
<212> PRT 

<213> Oryctolagus cuniculus 



<400> 79 

Tyr Pro Tyr Asp Val Pro Asp Tyr Ala 
1 5 



<210> 80 
<211> 10 
<:212> PRT 

<213> Oryctolagus cuniculus 
<400> 80 

Glu Gin Lys Leu He Ser Glu Glu Asp Leu 

5 



<210> 81 
<211> 11 
<212> PRT 

<213> Oryctolagus cuniculus 



Tyr Thr Asp Xle Glu Met Asn Lys Leu Gly Lys 



1 5 
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